The FastHCS Algorithm for Robust PCA
Principal component analysis (PCA) is widely used to analyze high-dimensional data, but it is very sensitive to outliers. Robust PCA methods seek fits that are unaffected by the outliers and can therefore be trusted to reveal them. FastHCS (High-dimensional Congruent Subsets) is a robust PCA algorithm suitable for high-dimensional applications, including cases where the number of variables exceeds the number of observations. After detailing the FastHCS algorithm, we carry out an extensive simulation study and three real data applications, the results of which show that FastHCS is systematically more robust to outliers than state-of-the-art methods.
Keywords: High-dimensional data, outlier detection, computational statistics, exploratory data analysis
Principal component analysis (PCA) is widely used to explore high-dimensional data. It centers and rotates the original -dimensional measurements to construct a small number of new orthonormal variables, called principal components, that account for most of the variation in the data. However, classical PCA is very sensitive to outliers. Outliers are observations that are inconsistent with the multivariate pattern of the majority of the data. If left unchecked, they influence the estimated parameters by disproportionately pulling the fit towards themselves. In this way, outliers obscure the main relationships in the data and their true outlyingness. In practice, we want to find the outliers to bound their influence on the fit and to study as objects of interest in their own right. For these reasons, we need robust PCA methods that meet the following criteria: (1) Like classical PCA, a robust PCA method should handle cases where the number of variables exceeds the number of observations, (2) and it should be shift and rotation equivariant, meaning that if the data are shifted or rotated the estimated parameters should transform accordingly. (3) It should be computable for high-dimensional data. (4) It should accurately describe the multivariate pattern of the majority of the observations, even when the data is heavily contaminated by outliers. (5) It should have a high breakdown point; a measure an estimator’s robustness to outliers in the data. (6) It should be insensitive to the dimensionality of the data. Criteria (1)-(3) are natural for any PCA method. Criteria (4)-(5) relate to robustness. Criterion (6) is related to both concerns. We find that state-of-the-art robust PCA algorithms have most of these properties, but that, surprisingly, many instances can be found where they fail to satisfy Criterion (4). In this paper, we introduce a robust PCA algorithm, FastHCS, to meet these criteria (HCS for high-dimensional congruent subset). In the next section we outline FastHCS. Then, in Sections (3) and (4) we compare it to several state-of-the-art methods on simulated data and three real data applications which show that in many settings only FastHCS can be relied upon to provide a robust PCA solution.
Given an data matrix and for a fixed , the FastHCS algorithm searches for a subset of size at least free of outliers (this is the minimal value of such that there are at least clean observations in each candidate subset).
If , FastHCS computes the mean-centered data matrix , and performs the kernel eigenvalue decomposition of where is an matrix, is an matrix, and is a diagonal matrix with the eigenvalues on the diagonal. Then, FastHCS works with the matrix . The transformation from to causes no loss of information or robustness since we retain all of the components corresponding to non-zero eigenvalues. However, this transformation reduces the computational cost of the subsequent steps of the algorithm. At the end of the algorithm, FastHCS reverses these transformations so that the returned parameter estimates are consistent with conventional PCA. When , we simply set .
2.1 The -index -subset
The -index is a subset selection criterion first introduced in Vakili and Schmitt (2014) where it is used to identify an outlier free subset to serve as the basis of the robust PCS location and scatter estimator. The -index was designed to be insensitive to the configuration of the outliers and consequently, as we show in that article, the fit found by FastPCS is nearly unaffected by the presence of outliers in the data (we refer to this as quantitative robustness). In (Schmitt et al., 2014) we further show that the PCS estimates also have the maximum possible breakdown point (we refer to this as qualitative robustness). Robust location and scatter estimation are also important for PCA. In the PCS context, the -index is applied to the observations in their original dimensionality, and one approach to achieving robust PCA would be to use the robust PCS covariance estimate as a starting point for PCA. However, this approach does not satisfy Criterion (3) for robust PCA since it is not possible to perform PCS when the number of dimensions is greater than the number of observations. This subsection describes how the -index can be extended to the PCA context by applying it to projections of the data on to subspaces.
To begin, FastHCS draws random subsets of size from without replacement, where is given by:
and where is an integer specifying the number of uncontaminated observations, so that the probability of getting at least one uncontaminated starting subset is at least 99% (Stahel, 1981). By default we set . However, if the user is sure that the contamination rate of the sample is lower than , we offer the possibility (as in Maronna and Yohai (1995)) of using this information to reduce the computational cost of running FastHCS. Denote these -subsets as . The SVD decomposition of the observations indexed by is:
where is the estimated center, is a diagonal matrix for which the non-zero elements are the descending eigenvalues of the PCA model fited to , and the eigenvectors are the first loadings of this model. Next, we compute the score matrix with rows :
which is the projection of the re-centered rows of on to the subspace spanned by the first loadings of . To measure the outlyingness of an to the members of , we will use its squared orthogonal distance to , the direction normal to the hyperplane through members of drawn at random:
and, to remove the dependence of this measure on the direction , we average it over such directions :
(In Remark 1 below we discuss how we set the value of the parameter ). The denominator in Equation (2) normalizes these distances across the directions .
We can now describe the computation of the first step of FastHCS. For a given a -subset of and its corresponding matrix , Algorithm 1 returns and -subset of indexes of using an iterative procedure we call growing steps. In each step , is updated and contains the indexes of the observations with smallest values of . The value of itself increases incrementally from to in steps. These steps do not have a convergence criterion, so the number of iterations must be set in advance (In Remark 1 below we discuss how we set the value of the parameter ).
|Algorithm 1: growingStep|
for to do: set set end for
After growing candidate ’s, FastHCS evaluates each using a criterion we call the -index, and fits a robust PCA model to the having smallest value of the -index. For a given -subset and direction , we define a subset that is optimal with respect to in the sense that it indexes the observations with the smallest values of . More precisely, denoting the order statistic of a vector , we have:
Then, we define the -index of an along as
with the convention that . The measure is always positive and increases the fewer members shares with along the direction . This is because, for a given direction , the members of not in will decrease the denominator in Equation (3) without affecting the numerator, increasing the overall ratio. As in the growing steps, we remove the dependence of Equation (3) on the directions by considering the average over directions:
Finally, FastHCS selects as the candidate -subset with lowest -index.
Given , we denote the PCA parameters corresponding to as and obtain them as follows:
where . FastHCS computes these parameters on the full space of the data set, , rather than on the space of , to increase their accuracy. Algorithm 2 summarizes the I-index step of FastHCS.
|Algorithm 2: IStep|
for to do: (q+1) end for return
Through experiments, we find that increasing above 25 or above 5 does not noticeably improve the performance of the algorithm (though it increases its computational cost), so we set these parameters to those values. Those experiments were carried on the outlier configurations discussed in Sections 3 and 4 as well as additional configurations enabled by the simulation suite provided with the Online Resources (Section 4). Because such experiments cannot cover all possible configurations of outliers, we focused on those configurations singled out as most challenging in the literature on robust PCA.
Exact fit: When the members of a subset lie on a subspace with the numerator and denominator of will be the same for any direction through members of (Schmitt et al., 2014) so that . Then, and will correspond with . In such situations, FastHCS will return the index of the members of . This behavior is called exact fit (Maronna et al., 2006).
Breakdown point: The (finite sample) breakdown point of an estimator referred to in Criterion (5) is the smallest proportion of observations that need to be replaced by arbitrary value to drive the estimates to arbitrary values (Donoho, 1982). Naturally, a higher breakdown point is better, and the maximal breakdown point achievable in the PCA context is essentially fifty percent.
Both the growing step and the I-index use distances computed on subspaces to derive a measure of outlyingness. Since they are restricted to this view of the data, they are vulnerable to outliers that appear consistent with the majority on a subspace, but are outlying with respect to it (Appendix A details the specific configurations of outliers giving rise to this issue). Therefore, fits based on alone will not have maximum breakdown and the procedure presented above must be combined with a second, computational expedient, ancillary procedure to ensure that the final FastHCS estimates do.
2.2 The Projection Pursuit -subset
Although experiments, such as those in Sections 3 and 4, show that the -index rarely selects contaminated subsets, it is vulnerable to specific configurations of outliers (see Appendix A). To guard against these, FastHCS uses a robust projection-pursuit (PP) approach to identify a second subset of observations, . The PP approach proceeds by assigning an outlyingness score to each observation:
where contains 1000 directions , each given by two data points drawn randomly from the sample, is the median of and . The PP method is orthogonaly invariant and computationally expedient. A version of the PP algorithm is used as an initial step in ROBPCA (Hubert et al., 2005; Debruyne and Hubert, 2009), a popular robust PCA algorithm.
2.3 Selecting the final PCA model
Consider the subset . Because , it holds that and is free of outliers whenever either one of or is. We propose to exploit this fact to select between the I-index and PP-based models. Denote and
with the assumption that . When (or if ) the final FastHCS parameters will be equal to and the final FastHCS subset is set as . As we show in Appendix B, this selection rule ensures that the FastHCS fit has a high breakdown point. Our approach is similar to the ROBPCA algorithm which also selects from among two candidate subsets in the final stage of the algorithm. ROBPCA selects the subset whose eigenvalues have the smallest product. However, depending on the configuration of the outliers and the rate of contamination, it is possible for a contaminated subset to have smaller eigenvalues than an uncontaminated one (Schmitt et al., 2014), and so to end up being selected by the criterion used in ROBPCA. In contrast, the selection criterion we propose controls (through the denominators in Equation (6)) for the relative scatter of the two subsets and so it is not biased towards subsets containing many concentrated outliers. Naturally, criterion (6) is designed to favor the -index based model whenever doing so does not cause breakdown.
2.4 Outlyingness to the PCA model
Two concepts of distance are used to assess the outlyingness of an observation with respect to a PCA model, and cut-off values for both of these can be used to classify outliers (Hubert et al., 2005). The first is the orthogonal distance () of the observation to the PCA model space:
Assuming multivariate normality of the observations on which the PCA model is fitted, a cut-off can be obtained for the statistics using the Wilson-Hilferty transformation of the s into approximately normally distributed random variables:
where is the quantile of the distribution with one degree of freedom, and indexes a subset of observations. The second measure of outlyingness is the score distance () of the observation on the PCA model space:
A 97.5% cut-off for the statistics can be obtained using a distribution.
In inferential applications, PCA theory typically assumes multivariate normality, though ellipticity is sufficient for many of the hypotheses of interest to PCA-based inference, see (Jensen, 1986) and (Jolliffe, 2002, pages 49,55,394). In any case, robust PCA performs inference with a model fitted on the non-outlying observations, so the distributional assumption pertains to only this subset of the data. Conversely, no assumptions are made on the distribution(s) of the outliers.
2.5 Computational considerations
The computational complexity of FastHCS is determined by the -index and PP subset selection components. The complexity of the PP-based approach is . This is dominated by the time complexity of the -index, which scales as for each starting -subset. Except when and are small (smaller than about 5 and 2000 in our tests) FastHCS is not the quickest of the robust PCA methods we considered (in general, we find that PcaL is). The ’Fast’ qualification in this context (”FastHCS”) is used to distinguish the algorithm based on random sub-sampling from the naïve one based on exhaustive enumeration of all possible starting points, the latter being usually not computable in practice. The computing time of FastHCS grows similarly in to other methods we discuss in this paper, while it is the most sensitive to increases in , with computation times being comparable until around . For higher , FastHCS is the slowest overall. In practice, FastHCS becomes impractical for values of much larger than 25. Nevertheless, the overall time complexity of FastHCS grows with , instead of , making it a suitable candidate for high-dimensional applications, and satisfying Criterion (3) for a robust PCA method. Moreover, FastHCS belongs to the class of so called ‘embarrassingly parallel’ algorithms, i.e. its time complexity scales as the inverse of the number of processors, meaning it is well suited to benefit from modern computing environments. To enhance user experience, we implemented FastHCS in C ++ and wrapped it in an portable, open source R package (R Core Team, 2012) distributed through CRAN (package FastHCS)
3 Simulation Study
In this section we evaluate the behavior of FastHCS against three other robust PCA methods: the ROBPCA (Hubert et al., 2005), PcaPP (Croux and Ruiz-Gazen, 2005) and PcaL (Locantore et al., 1999) algorithms. Although other methods for high-dimensional outlier detection exist, these are particularly comparable with FastHCS: all three are PCA algorithms, satisfying Criteria (1)-(3) of a robust PCA method. ROBPCA, PcaPP and PcaL were computed using the R (R Core Team, 2012) package rrcov (Todorov and Filzmoser, 2009) with default settings except for the robustness parameter alpha for ROBPCA which we set to 0.5, the value yielding maximum robustness and the value of k which we set to for all the algorithms. Our evaluation criterion is the empirical bias, a quantitative measure of robustness of a fit. In Appendix C we explain the motivation for this choice (in the Online Resources we also report the results obtained using alternative evaluation criteria).
3.1 Empirical bias
Given an elliptical distribution with location vector and covariance matrix (the superscript stands for uncontaminated) and an arbitrary distribution (the superscript stands for contaminated), consider the -contaminated model
For a fixed denote the rank approximation of and an estimator of . The (empirical) bias measures the difference between and . For this, we will consider more specifically the shape component of this difference which is called the shape bias. Given these two (rank reduced) covariance matrices, recall that the corresponding shape matrices are defined by and . For an estimator of , all the information about the shape bias is contained in the matrix , or equivalently its condition number (Yohai and Maronna, 1990):
where () is the largest () eigenvalue of the positive-semidefinite matrix . Evaluating the maximum bias of is an empirical matter: for a given sample, it depends on the dimensionality of the data, the rate of contamination by outliers, the distance separating them from the uncontaminated observations, and the spatial configuration of the outliers (). However, because all the algorithms we compare are rotation and shift equivariant, w.l.o.g. we can focus on configurations where is diagonal, and (a -vector of zeros), greatly reducing the number of scenarios we need to consider. Since the effect of contamination is presumably most harmful when the outlier belongs to the subspace spanned by (the orthogonal complement of ) we, concentrate on the class of outlier configurations satisfying these conditions (Maronna, 2005). In the simulation results shown in Section 3.3, the outliers belong to the subspace spanned by the eigenvector corresponding to the -th eigenvalue of (as in Hubert et al. (2005)) whereas the simulation settings shown in the Online Resources the outliers belong to the subspace spanned by all the components of (as is done in Maronna (2005)).
3.2 Outlier configurations
To quantify the robustness of the four algorithms, we generate many contaminated data sets of size with where and are the uncontaminated and contaminated part of the sample. In all simulations, the center of the uncontaminated data its is either or . We show results where , , , and is one of
To facilitate comparison, we consider a generalization to arbitrary values of of the parametrization of used in (Hubert et al., 2005). To define this matrix, , we set the values of the first elements of the diagonal of so that they decrease exponentially and do not drop abruptly before the remaining, smaller, entries. More precisely, the first entries of the diagonal of are the first Fibonacci numbers and the entries are linearly decreasing as (). In Section 2 of the Online Resources, we also provide results using a covariance matrix proprosed by (Maronna, 2005).
Our measure of robustness, the bias, depends on the distance between the outliers and the non outlying observations which we will measure by
where is an indicator for the observations coming from . (A more detailed description of how we set the location of the outliers can be found in Section 3 of the Online Resources.) In the simulations, the distance separating the outliers from the good data is one of
We consider two outlier configurations frequently used in the robust PCA literature (Hubert et al., 2005; Maronna, 2005): (a) Shift outliers: and chosen to satisfy Equation (10); (b) Point-mass outliers: all the outliers are concentrated around a single point at a distance from . To obtain this effect, we set . Both of these outlier configurations are relevant in practical applications where they are, for example, similar to certain types of sensor faults and contamination scenarios.
For FastHCS, the number of initial -subsets is given as in Equation (1), with . The rrcov implementations for ROBPCA and PcaPP include hardcoded values for the number of starting subsets presumed by their authors to be sufficient for these methods. PcaL does not require starting subsets. Section 4 of the Online Resources explains how the reader can use code we supply to replicate all results from this section.
In Figures 1 to 2, we display the bias curves as lattice plots (Deepayan, 2008) for discrete combinations of , and . In all cases, we expect the outlier detection problem to become monotonically harder as we increase and , so little information will be lost by considering a discrete grid of a few values for these parameters. The configurations also depend on the distance separating the data from the outliers. Here, the effects of on the bias are harder to foresee: clearly nearby outliers will be harder to detect but misclassifying distant outliers will increase the bias more. Therefore, we will test the algorithms for many values (and chart the results as a function) of . For each algorithm, a solid colored line will depict the median, and a dotted line (of the same color) the 75th percentile of . Each figure is based on 12000 simulations.
3.3 Simulation results
Figure 1 displays the bias curves corresponding to the fits found by the algorithms for for the shift (right) and point-mass (left) configurations. Regardless of the spatial configuration of the outliers or the value of , the fits found by PcaPP and PcaL generally have high values of . As it turns out, PcaPP and PcaL will show poor performance on all of the remaining simulations as well. Since this poor performance is consistent, we will not discuss it in detail. The performance of ROBPCA is substantially better than the previous two algorithms, but it becomes increasingly unreliable as increases. FastHCS shows almost no bias. Furthermore, we see that in some cases even after the bias curves of ROBPCA have re-descended, they are still above those of FastHCS. Given that this gap increases with , we infer that the eigenvalue estimation of ROBPCA is still influenced by the outliers, even when the eigenvectors are correctly estimated. FastHCS estimates both correctly.
We next consider the high dimensional case of . Figure 2. Across all scenarios, the results are comparable to those in seen in Figure 1. FastHCS is the best performing method, being unaffected by the outliers, while the other methods show high biases on most settings.
Over all of the scenarios, FastHCS shows almost no bias, despite challenging outlier configurations. Furthermore, the bias curves corresponding to the fits found by FastHCS are also less variable: throughout, the 75th percentile of the bias corresponding to the FastHCS fit is typically closer to the median bias than is the case for the other algorithms. These findings indicate that FastHCS meets Criteria (4)-(5) for a robust PCA method. In contrast, we find that the performance of the other methods vary with the configuration of the outliers, the rate of the contamination, and the dimensionality of the -subspace. In Section 5 of the Online Resources, we re-analyze these simulation results, giving similar plots for a measure of outlier misclassification, as well as the principal angle measure and , two measures of quality of fit focusing on the loadings.
An extended simulation study shows that the results we present above are robust the choice of different simulation settings (for example, those used in (Maronna, 2005)), and different bias measurments criterions. However, since the outcome of the extended study is nearly identical to the one we present in this section, we have relegated these results in the Online Resources.
4 Real data applications
We next apply the algorithms to three real data examples. We selected these examples because in each the observations in the data can be separated into two subgroups from which we construct a majority and an outlier group. They are taken from three fields that regularly use PCA: character recognition, chemometrics and genetics. A feature shared by all of these data sets is that the variables within each are measured on the same scale. Data sets with this property were selected to remove the ancillary problem of data standardization. If the variables are not on the same scale, it is common practice in PCA modelling to standardize the data, but the choice of how to do so robustly adds a layer of subjectivity to the results. For the interested reader, the data sets used in this section are included in the FastHCS package. Section 6 of the Online Resources explains how the reader can use code we provide to replicate all results in this section.
The implementations of ROBPCA and PcaPP we use do not have an option to set the seed, but to ensure reproducibility of the results for FastHCS, we set seed=1. PcaL is a deterministic algorithm and uses no seeds. As in the simulations, we run each algorithm with default settings, except for the alpha parameter in ROBPCA which we set to 0.5. To illustrate the outlier detection capabilities of the algorithms, we display diagnostics plots. These show the and values for each observation, divided by the cut-off values in Equations (8) and (9) to put each of the methods on a comparable scale.
4.1 Selecting the number of components
We recommend using as large a value of as possible for FastHCS, since this improves its outlier detection performance. However, to avoid the curse of dimensionality, it is also advised to set (Hubert et al., 2005). In all the examples that follow, we select a relatively high number of components, , to strike a balance between computation time and accuracy. Once the outliers have been detected, components with large eigenvales can be analyzed and used to construct a PCA model of the good data. One may also wish to use a selection criterion, such as the scree chart or contribution to variance. In Section 7 of the Online Resources, we also show results using the latter of these approaches in an extended analysis. In that analysis, the chemometrics data set illustrates how robust PCA methods parametrized based on a parsimonious, eigenvalue-based criterion, may miss outliers on minor components, even when the majority of the data may be modelled using a parsimonious model.
4.2 The Multiple Features data set
The Multiple Features data set (Van Breukelen et al., 1998) contains many replications of hand written numerals (’0’-’9’) extracted from nine original maps of a Dutch public utility. For each numeral, we have 200 replications (the observations) expressed as a vector of 76 of Fourier coefficients (the features) describing its shape. Finally, each numeral has been manually identified, yielding an extra vector of class labels. In this application, we will combine the vectors of Fourier coefficients corresponding to the 200 replications of the digit ’1’ to the vector of Fourier coefficients corresponding to the first 150 replications of the digit ’0’ (so that and ). The goal of the methods will be to distinguish the ’0’s and the ’1’s.
To give an impression of the differences between the two groups, we plot the Fourier coefficients corresponding to the main (outlier) subgroup in the bottom (top) panels of Figure 5 as dark blue (light orange) curves. In general, the curves corresponding to the members of the two groups are visually similar. In particular, the vertical ranges of both largely overlap, and both sets of curves exhibit a similar pattern of variance clustering where the central 40 Fourier coefficients have systematically less dispersion than higher or lower ones.
Figure 8 depicts the four resulting diagnostic plots. We assign to each observation a color (dark blue or light orange) and a plot symbol (round or triangle) depending on whether the corresponding curve describes a member of class ’1’ or ’0’, respectively. The outlier plots of PcaPP and PcaL show that neither method makes any distinction between the two digits, and observations from both groups influence the corresponding PCA models. ROBPCA discovers a different structure in the data, confounding the ’0’s as the majority group and the ’1’s as outliers on the model space. Since only a few ’1’s are outliers, almost all of the observations influence the fitted loadings. In contrast to the other methods, FastHCS correctly identifies all of the ’0’s as outliers and identifies some ’1’s that might warrant additional scrutiny.
4.3 The Tablet data set
The Tablet data (Dyrby et al., 2002) contains the results of an analysis on Escitalopram tablets from the pharmaceutical company H. Lundbeck A/S using near-infrared (NIR) spectroscopy. The study includes tablets of four different dosages from pilot, laboratory and full scale production settings are included. Each tablet (the observations) is measured along 404 wavelengths (the variables). From this data, we extract two subsets of observations which we combine to obtain a new data set formed of two heterogeneous subgroups. Tablets of 80mg will make up the majority group and the rows corresponding to the first 50 tablets with a nominal weight of 250mg will serve as the outliers. This gives a high-dimensional data set (i.e. ) with , and a contamination rate of .
Figure 11, depicts the spectra of the 250mg (light orange) and 80mg (dark blue) tablets. The spectra of the 250mg tablets follow a different multivariate pattern than those of the 80mg tablets. For example, the spectra of the former are generally lower and more spread out than the spectra of the latter. Dyrby et al. (2002) explain that accurate models for NIR analyses of medical tablets are valuable for quality control purposes, since they are fast, nondestructive, noninvasive, and require little preparation. The goal of the algorithms will be to fit a model to the 80mg tablets, despite the presence in the sample of many 250mg tablets.
Figure 14 depicts the diagnostic plots of the scaled outlyingness measures obtained from each of the algorithms. To enhance the distinction between the two groups in our data, we show the 80mg tablets as (dark) blue circles and the 250mg tablets as (light) orange triangles. These results are similar to those we saw when we examined the Multiple Features data set. Again PcaPP and PcaL do not distinguish between the two groups and ROBPCA uses both groups to fit the loadings parameters and confuses the outliers with the majority group on the model space. In contrast, the diagnostic plot derived from the FastHCS fit establishes that the 250mg tablets do not follow the same multivariate patterns as 80mg tablets and, in fact, depart significantly from it. In the plot, we see that FastHCS assigns the outliers high values; excluding them from loadings and eigenvalue estimation. It also assigns many of them high values; revealing their distance on the model space.
4.4 DNA Alteration data set
In our final case study, we examine another high-dimensional data set; the DNA Alteration data set (Christensen et al., 2009). This data set consists of cytosine methylation values collected at 1413 autosomal CpG loci (the variables) in a sample of 217 non-pathological human tissue specimens (the observations) taken from 10 different anatosites. In Christensen et al. (2009), the authors show that the tissue samples in this data set form three well separated subgroups. The first of these constitutes all 113 observations corresponding to cytosine methylation values measured on ”non-blood, non-placenta” (henceforth, simply ”non-blood”) tissues. A second subgroup of data points comprises the 85 cytosine methylation measurements taken on blood tissues.
In this application, we will combine the 113 measurements of cytosine methylation values corresponding to the samples ”non-blood” tissue with 85 measurements taken from blood tissues (so that and ). In Figure 17, we plot the 1413 values corresponding to each blood (light orange) and non-blood (dark blue) observation. Visually, the curves of these two groups appear difficult to distinguish from one another. In particular, the vertical range of both overlap and the groups do not exhibit any pronounced difference in the variability of the variables.
The diagnostic plot for PcaPP (Figure 20) reveals that the fitted model regards blood and non-blood tissue to be quantitatively similar. ROBPCA and PcaL also detect almost none of the outliers, but additionally consider a number of the non-blood observations to be outliers. As in the previous case studies, FastHCS correctly identifies all of the outliers. As a consequence, the parameter estimates corresponding to this model are more likely to reflect the true structure of non-blood tissue than those fitted by the other algorithms.
In this article we introduced FastHCS to satisfy a number of criteria we expect a robust PCA method to have. In both the simulations and real data examples we performed, FastHCS met all of these criteria. In contrast, state-of-the-art methods did not, and often produced results one would expect from a non-robust method. This may seem like an extreme outcome, but it is in fact the very nature of dealing with outliers: if a method fails to identify them, the resulting model fit is often profoundly changed.
It is interesting to compare the performance of FastHCS and ROBPCA because these methods both use variants of projection pursuit. While FastHCS compares the fit produced by the -index to that from the projection pursuit criterion, ROBPCA relies completely on the projection pursuit criterion to construct its initial subset. Thus, the difference in performance between FastHCS and ROBPCA that we observe in our simulations and real data examples arises from the fact that FastHCS nearly always chooses the -index subset over the projection pursuit one.
In most applications, admittedly, data settings and contamination patterns will not be as difficult as those we featured in our simulations and real data examples, and in these easier cases the different methods will, hopefully, concur. Nevertheless, in three real data examples from fields where PCA is widely used, we were able to establish that real world situations can be challenging enough to push current state-of-the-art outlier detection procedures to their limits and beyond, justifying the development of better solutions. In any case, given that in practice we do not know the configuration of the outliers, as data analysts, we prefer to carry out our inferences while planning for the worst contingencies.
The authors wish to acknowledge the helpful comments from two anonymous referees and the editor which improved this paper.
Appendix A Vulnerability of the -index to orthogonal outliers
Throughout this appendix, let be an data matrix of uncontaminated observations drawn from a rank distribution , with and integer satisfying . However, we do not observe but an (potentially) corrupted data matrix that consists of observations from and arbitrary values with denoting the (unknown) rate of contamination. Throughout, and the PCA estimates are defined as in Section (2) with will denoting the -th diagonal entry of .
Equation (11) defines the so-called finite sample explosion breakdown point and Equation (12) the so-called finite sample implosion breakdown point of PCA estimates , and the general finite sample breakdown point is .
The following assumptions (as per, for example Tyler (1994)) all pertain to the original, uncontaminated, data set . We will consider the case whereby the point cloud formed by lies in general position in . The following definition of general position is adapted from Rousseeuw and Leroy (1987):
Definition 1: General position in . is in general position in if no more than -points of lie in any -dimensional affine subspace. For -dimensional data, this means that there are no more than points of on any hyperplane, so that any points of always determine a -simplex with non-zero determinant.
The -index is shift invariant so that, w.l.o.g., we only consider cases where the good observations are centered at the origin. Throughout, we will also assume that the members of are bounded:
for some bounded scalar depending only on the uncontaminated observations and that the uncontaminated observations contain no duplicates:
a.1 Theorem 1: The implosion breakdown, , is
If at least rows of are in general position in , any subset of observations will contain at least observations in general position. This guarantees that the eigenvalue corresponding to any -subset is non-zero (Seber, 2008). Thus, it follows that .
a.2 Finite sample explosion breakdown of
Denote the outlying entries of and . The only outliers capable of causing explosion breakdown must satisfy:
for any bounded scalar and depending only on the uncontaminated observations.
Suppose that the outliers do not satisfy Equation (13) so that , but that the PCA estimates break down. This leads to a contradiction since
Therefore, for a contaminated -subset to cause explosion breakdown, the outliers must satisfy Equation (13).
Assume that an outlier does not satisfy Condition (14). Schmitt et al. (2014) showed that any subset indexing will have an unbounded value of if and only if is unbounded. But for the uncontaminated data, it holds that
so if the contaminated data set contains at least entries from the original data matrix , then it is always possible to construct a subset of entries of for which is bounded so that will never be selected over .
Appendix B The finite sample breakdown point of FastHCS
In this appendix, we derive the finite sample breakdown point of FastHCS. Define , and as in Appendix A. Recall that
where . Then, if or if then the final FastHCS estimates are based on . Otherwise, they are based on .
If and , then .
(Debruyne and Hubert, 2009) showed that the population breakdown point of is 50%, which corresponds to a finite sample breakdown point of . Consequently, will not index any data point for which . Since indexes the overlap between and , if , then .
When is in general position, , and .
We will proceed by showing that the denominators in Equation (17) are bounded, while only the numerator dependent on is bounded.
Lemma 1 implies there exists a fixed constant such that
for any orthogonal matrix . Similarly, since the projection pursuit approach has a breakdown point of , there exists a fixed such that
Next, note that
(Equation (22) follows from Appendix A, Theorem 1), so that
is not bounded from above. Conversely, has an explosion breakdown point of , so that there exists a fixed such that:
From Equations (20) and the unboundedness of (23) it follows that the left-hand side in Equation (17) is unbounded. However, by Equations (20) and (24), the right-hand side of Equation (17) is bounded from above so that in cases where outliers cause explosion breakdown of , criterion (17) will select . Since the breakdown point of is , we have that .
When is in general position, , and , then .
By Appendix A, Theorem 1, we have that the implosion breakdown point of is . The implosion breakdown point of is , which is higher, so it follows that .
For , the finite sample breakdown point of is
The finite sample breakdown point of . Given Lemmas 2 and 3, .
Appendix C Measures of dissimilarity for robust PCA fits.
The objective of the simulation studies in Section 3.3 is to measure how much the fitted PCA parameters obtained by four robust PCA methods deviate from the true when they are exposed to outliers. One way to compare PCA fits is with respect to their eigenvectors, as in the maxsub criterion (Björck and Golub, 1973):
where is the smallest eigenvalue of the matrix . The maxsub has an appealing geometrical interpretation as it represents the maximum angle between a vector in and the vector most parallel to it in . However, it does not exhaustively account for the dissimilarity between two sets of eigenvectors. As an alternative to the maxsub, Krzanowski (1979) proposes the total dissimilarity:
which is an exhaustive measure of dissimilarity for orthogonal matrices. Furthermore, because and (Krzanowski, 1979), it is readily seen that (25) is a measure of sphericity of (it is proportional to the likelihood ratio test statistics for non-sphericity of (Muirhead, 1982, p. 333-335)). However, note that (25) now forfeits the geometric interpretation enjoyed by the maxsub.
In any case, measures of dissimilarity based solely on eigenvectors, such as the maxsub or sumsub, necessarily fail to account for bias in the estimation of the eigenvalues. This is problematic when used to evaluate robust fits because it is possible for outliers to exert substantially more influence on than on . An extreme example is given by the so-called good leverage type of contamination in which the outliers lie on the subspace spanned by so that even the classical PCA estimate (whose eigenvalues can be made arbitrarely bad by such outliers) will have low values of .
In contrast, we are interested in an exhaustive measure of dissimilarity; one that summarizes the the effects of the outliers on all the parameters of the PCA fit into a single number, so that the algorithms can be ranked in terms total dissimilarity. To construct such a measure, it is logical to base it on and its estimate because they contain all the parameters of the fitted model. For our purposes, one need to only consider the effects of outliers on , the shape component of (Hubert et al., 2014). This is because to rank the observations in a contaminated sample in terms of their true outlyingness (and thus reveal the outliers), it is sufficient to estimate the shape component of correctly. Consequently, an exhaustive measure of dissimilarity between and is given by , where is any measure of non-sphericity of its argument. In practice several choices of are possible, the simplest being the condition number of which is defined as the ratio of the largest to the smallest eigenvalue of (Maronna and Yohai, 1995), explaining the definition of .
- Björck and Golub (1973) Björck, Å. and Golub, G. H. (1973). Numerical Methods for Computing Angles Between Linear Subspaces. Mathematics of Computation, 27, 2, 579–594.
- Christensen et al. (2009) Christensen, B.C Houseman, E.A. Marsit, C.J. Zheng, S. Wrench, M.R. Wiemels, J.L. Nelson, H.H. Karagas, M.R. Padbury, J.F. Bueno, R. Sugarbaker, D.J Yeh, R., Wiencke, J.K. Kelsey, K.T. (2009). Aging and Environemental Exposure Alter Tissue-Specific DNA Methylation Dependent upon CpG Island Context. PLoS Genetics 5(8), e1000602.
- Croux and Ruiz-Gazen (2005) Croux, C. and Ruiz-Gazen, A. (2005). High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis, 95, 206–226.
- Donoho (1982) Donoho, D.L. (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying Paper Harvard University.
- Debruyne and Hubert (2009) Debruyne, M. and Hubert, M. (2009). The influence function of the Stahel-Donoho covariance estimator of smallest outlyingness. Statistics & probability letters 79(3), 275–282.
- Deepayan (2008) Deepayan, S. (2008). Lattice: Multivariate Data Visualization with R. Springer, New York.
- Dyrby et al. (2002) Dyrby, M. Engelsen, S.B. Nørgaard, L. Bruhn, M. and Lundsberg Nielsen, L. (2002). Chemometric Quantitation of the Active Substance in a Pharmaceutical Tablet Using Near Infrared (NIR) Transmittance and NIR FT Raman Spectra Applied Spectroscopy 56(5): 579–585 .
- Hubert et al. (2005) Hubert, M. Rousseeuw, P. J. and Vanden Branden, K. (2005). ROBPCA: a new approach to robust principal components analysis. Technometrics, 47, 64–79.
- Hubert et al. (2014) Hubert, M., Rousseeuw, P. and Vakili, K. (2014). Shape bias of robust covariance estimators: an empirical study. Statistical Papers, Volume 55, Issue 1, pp 15–28.
- Jensen (1986) Jensen, D. R. (1986), The Structure of Ellipsoidal Distributions, II. Principal Components. Biometrical Journal, 28: 363–369.
- Jolliffe (2002) Jolliffe, I.T. (2002). Principal Component Analysis. Springer, New York. Second Edition.
- Krzanowski (1979) Krzanowski, W.J. (1979). Between-Groups Comparison of Principal Components. Journal of the American Statistical Association, Vol. 74, No. 367, pp. 703–707.
- Li and Chen (1985) Li, G., Chen, Z. (1985). Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. Joural of the American Statistical Association, 80, pp. 759–766.
- Locantore et al. (1999) Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T., and Cohen, K. L. (1999). Robust principal component analysis for functional data. Test. 8(1), 1–73.
- Maronna and Yohai (1995) Maronna R. A. and Yohai V.J. (1995). The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90 (429), 330–341.
- Maronna (2005) Maronna, R. (2005). Principal Components and Orthogonal Regression Based on Robust Scales. Technometrics, 47, 264–273.
- Maronna et al. (2006) Maronna, R. A.; Martin, R. D. and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, New York.
- Muirhead (1982) Muirhead, R.J. (1982). Aspects of Multivariate Statistical Theory. John Wiley and Sons, New York.
- R Core Team (2012) R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria.
- Rousseeuw and Leroy (1987) Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection. Wiley, New York.
- Schmitt et al. (2014) Schmitt, E. Öllerer, V. and Vakili, K. (2014). The finite sample breakdown point of PCS. Statistics & Probability Letters, 94, 214–220.
- Seber (2008) Seber, G. A. F. (2008). Matrix Handbook for Statisticians. Wiley Series in Probability and Statistics. Wiley, New York.
- Stahel (1981) Stahel W. (1981). Breakdown of Covariance Estimators. Research Report 31, Fachgrupp für Statistik, E.T.H. Zürich.
- Todorov and Filzmoser (2009) Todorov V. and Filzmoser P. (2009). An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32, 1–47.
- Tyler (1994) Tyler, D.E. (1994). Finite Sample Breakdown Points of Projection Based Multivariate Location and Scatter Statistics.
- Vakili and Schmitt (2014) Vakili, K. and Schmitt, E. (2014). Finding multivariate outliers with FastPCS. Computational Statistics & Data Analysis, Vol. 69, 54–66.
- Van Breukelen et al. (1998) Van Breukelen, M. Duin, R.P.W. Tax, D.M.J. and Den Hartog, J.E. (1998). Handwritten digit recognition by combined classifiers. Kybernetika, 34, 381–386.
- Wu et al. (1997) Wu, W., Massart, D. L., and de Jong, S. (1997), The Kernel PCA Algorithms forWide Data. Part I: Theory and Algorithms. Chemometrics and Intelligent Laboratory Systems, 36, 165–172.
- Yohai and Maronna (1990) Yohai, V.J. and Maronna, R.A. (1990). The Maximum Bias of Robust Covariances. Communications in Statistics–Theory and Methods, 19, 2925–2933.