Outlier detection and trimmed estimation for general functional data
This article introduces trimmed estimators for the mean and covariance function of general functional data. The estimators are based on a new measure of outlyingness or data depth that is well defined on any metric space, although this paper focuses on Euclidean spaces. We compute the breakdown point of the estimators and show that the optimal breakdown point is attainable for the appropriate choice of tuning parameters. The small-sample behavior of the estimators is studied by simulation, and we show that they have better outlier-resistance properties than alternative estimators. This is confirmed by two real-data applications, that also show that the outlyingness measure can be used as a graphical outlier-detection tool in functional spaces where visual screening of the data is difficult.
Breakdown Point; Data Depth; Robust Statistics; Stochastic Processes.
Many statistical applications today involve data that does not fit into the classical univariate or multivariate frameworks; for example, growth curves, spectral curves, and time-dependent gene expression profiles. These are samples of functions, rather than numbers or vectors. We can think of them as realizations of a stochastic process with sample paths in , the space of square-integrable functions. The statistical analysis of function-valued data has received a lot of attention in recent years (see e.g. Ramsay and Silverman 2002, 2005, and references therein). However, most of the work on Functional Data Analysis has focused on univariate curves; in many applications, the sample functions are not univariate.
Consider, for example, excitation-emission matrices (EEMs), which are common in Chemometrics. When certain fluorescent substances are exposed to light of wavelength , they emit light at wavelength . The resulting light intensity is then a bivariate function , that is, a function. Mortensen and Bro (2006) analyzed a collection of 338 such surfaces; the logarithms of four of them are shown in Figure ?. A movie showing the 338 surfaces in quick succession is available on the author’s website; it is clear in this movie that there are some atypical surfaces in the data set. For example, Figure ? shows that even after taking logarithms, the surface on the lower right corner is out of line compared to the other three. These atypical objects is what we will refer to as outliers in this paper; that is, objects that depart from the main modes of variability of the majority of the data. Note that since each functional object typically consists of many measurements taken at different time points (or wavelength points, in this case), a few of those measurements could be outlying without the whole surface being necessarily atypical. But that kind of isolated measurement errors are not the type of outliers we are interested in in this paper; they have been addressed in the robust smoothing literature (e.g. Shi and Li 1995, Jiang and Mack 2001.)
As a second example, consider a digit recognition problem. Eight handwritten fives, from a total of 1055 samples, are shown in Figure ?. The planar trajectory of the pen tip is a curve in , where the variable is time, so the digits are functions (note that we are ignoring a third variable, , the distance between the pen tip and the writing pad). In Figure ? we see that some of the handwritten digits look more like sixes than fives. The reason for this is going to be explained in Section ?, but it is clear at this point that the sample of fives is not homogeneous; it contains either isolated outliers or systematic clusters that are hard to guess a priori.
These examples show two things, which are the main motivation for this paper: (i) functional data belonging to spaces more complicated than are encountered in practice, and (ii) outliers may be present in a sample but, due to the complexity of the data, visual screening of the data set may be impractical or impossible. The problem of robust estimation in functional spaces has been addressed by some authors, including Locantore et al. (1999), Fraiman and Muniz (2001), Cuevas et al. (2007), Gervini (2008), and López-Pintado and Romo (2009). But all of these papers deal with univariate curves. Some of these methods can be extended to more complex spaces in a more or less straightforward way, but some of them cannot. For example, the methods of Fraiman and Muniz (2001) and López-Pintado and Romo (2009) are based on data-depth notions that require an ordering of the response variables and then they cannot be extended to vector-valued functions like the handwritten digits in an obvious way. On the other hand, the projection-based methods discussed in Cuevas et al. (2007) and the spatial median and the spherical principal components of Locantore et al. (1999) and Gervini (2008) can be extended to any Euclidean space; but Gervini (2008) found that the breakdown point of the spherical principal components is very low, so a third goal of this paper is to develope principal component estimators that are more robust than the spherical principal components but not as computationally demanding as the projection-based methods of Cuevas et al. (2007).
The estimators introduced in this article are based on a measure of outlyingness that can be defined on any metric space, but we will restrict ourselves to Euclidean spaces, where principal components can also be defined. These estimators are easy to compute and turned out to have very good robustness properties. We prove in Section ? that they can attain the optimal 50% breakdown point (i.e. they can resist up to 50% of outliers in the data). In our simulation study (Section ?) they outperformed most of the alternative estimators cited above. The paper also studies other theoretical properties in Section ?, and analyzes in more detail the two applications mentioned above (Section ?). Proofs of the theoretical results and an additional real-data application can be found in a technical supplement available on the author’s webpage.
2Trimmed estimators based on interdistances
2.1A measure of outlyingness
Let be a sample in a Euclidean space , i.e. a linear space endowed with an inner product (for instance, with its canonical inner product .) The inner product induces the norm in , and this norm induces the distance function , so any Euclidean space is a metric space. Let us consider the set of interdistances . An observation can be seen as an outlier if it’s far from most of the other observations (not necessarily from all of them, because outliers sometimes form clusters). Given , we define the -radius as the distance between and the -th closest observation, where denotes the integer closest to from above. This is the radius of the smallest ball centered at that covers of the observations. Intuitively, will be small where the data is dense and large where the data is sparse (see Proposition prop:r_depth in Section ?). Therefore, the rank of in the set will be a measure of the outlyingness of : the more isolated is, the larger will be compared to the other radii.
In principle the coverage parameter could be any number between 0 and 1, but note that if there is a tight cluster of outliers and , then the radii of the outliers will be small, perhaps even smaller than the radii of the good observations, which would render them useless for our purposes. Therefore must be large enough that at least one good observation is captured by whenever is an outlier. Since can be as large as , in general only will guarantee this. On the other hand, taking may cause the opposite problem: that an outlying observation will always be captured by when is not an outlier, making the radii of the good observations too large (the formalization of these heuristics constitute the proof of Proposition prop:BDP in Section ?). For these reasons we will always take for estimation purposes. However, for outlier-screening purposes it is instructive to see boxplots and histograms of the radii for values of less than ; the outliers tend to emerge clearly and consistently as increases.
At this point some comments about the actual computation of the interdistances are in order. First, note that all the interdistances can be computed from the set of inner products , since . It is easy to compute the pairwise inner products when the sample objects have been pre-smoothed, or even if they have not been pre-smoothed but they were sampled on a regular common grid without much random error. In that case, a basic numerical integration method such as the trapezoidal rule will give accurate results (see Gervini 2008, Theorem 1). But if the s were sampled on sparse and irregular grids, perhaps with a different grid for each individual, then it will not be possible to estimate all pairwise inner products and this method cannot be applied (the other methods mentioned in the introduction cannot be applied either, since they are based on pre-smoothed data).
In addition to being useful outlying-screening tools, the radii can be used to construct robust estimators of the mean, the covariance function and the principal components of the process under consideration. For a stochastic process in with , the mean operator and the covariance operator are defined as follows: is given by , and is given by (these quantities are well defined because and are real-valued random variables with finite variances for any and in .) By Riesz Representation Theorem there exists a unique such that , which we call (this is one way to define the expectation of a stochastic process in a Euclidean space.)
In a Euclidean space it is also possible to define principal directions of variability, or principal components. The first principal component of is that maximizes among with ; the second principal component is that maximizes among with and ; and so on. It can be shown (Gohberg et al. 2003, chap. IV) that the principal components are eigenfunctions of the covariance operator and they are countable; that is, with and .
The classical estimators of these quantities (the sample mean, covariance, and principal components) are not resistant to outliers. As a more robust alternative we propose trimmed estimators based on the radii. Specifically, given a trimming proportion we define and
These are hard-trimmed estimators, where a 0-1 weight function is used. More generally, we can define weights of the form , where is a bounded, non-negative and non-increasing function such that for and for . Soft-trimming weights are obtained with a smooth function such as
where for some , and . This function downweights the largest radii, and cuts off the largest radii completely; we can take, for example, and .
Trimmed estimators based on various measures of data depth have been proposed in other contexts, in particular in multivariate analysis (Fraiman and Meloche 1999, Liu et al. 1999, Serfling 2006, Zuo and Serfling 2000, Zuo et al. 2004). The behavior of these estimators varies according to the specific data-depth measure that is being used, but as a general rule, their outlier resistance increases as increases and their efficiency decreases as increases (see e.g. Stigler 1973; Van der Vaart 1998, chap. 22; Maronna et al. 2006, chap. 2). Since there is a trade-off between robustness and efficiency, we recommend choosing in a data-driven way: a histogram of the radii usually gives a good idea of the proportion of outliers in the sample, and this value could be used as . A more objective alternative, suggested by a referee, is to fit a mixture of two Gamma distributions to the sample of radii and take as the proportion of observations in the smaller group. If instead of these data-driven choices of the user prefers to use a fixed , our simulations showed that soft-trimming weights like (Equation 3) are preferrable to hard-trimming weights (see Section ?).
Just like the radii (and therefore the weights ) depend on the data only through the inner products as mentioned in the previous section, the principal components of ( eq:C_hat_gen) can also be computed entirely from the inner products, as explained in Gervini (2008) and Jolliffe (2002, ch. 3.5): if , then and , where is the th unit-norm eigenvector of the matrix with elements , and is the th eigenvalue (the s can be expressed entirely in terms of the s and the s, after some algebra). The applicability of these estimators will then be limited only by the possibility of computing all pairwise inner products. As mentioned before, this is generally not possible if the data objects were sparsely and irregularily sampled, and alternative estimation methods must be sought. For instance, the reduced-rank -model estimators of Gervini (2010), which were originally developed for sparsely sampled univariate curves, can be extended to more general functional spaces, but this is clearly outside the scope of this paper.
3Properties of the estimators
Location and scatter estimators must satisfy certain equivariance properties, in order to be proper measures of location and scatter . A location estimator must be translation equivariant: if is the estimator based on the sample , then the estimator based on the sample , with , must be . Other desirable properties are scale and rotation equivariance: if is the estimator based on the sample , then the estimator based on the sample , with a unitary operator and , must be (a unitary operator is such that for every .) A scatter estimator, on the other hand, must be translation invariant (i.e. remain unchanged under translations) and rotation and scale equivariant in the following sense: if is the covariance estimator based on the sample , then the covariance estimator based on the sample must be , where is the adjoint of (i.e. the unique operator that satisfies for every and in .) The rotation equivariance of automatically implies rotation equivariance of the principal component estimators obtained from .
Our trimmed estimators satisfy these properties, as shown in Proposition prop:Equivariance. This is a consequence of the translation and rotation invariance of the radii, and therefore of the weights (which, in addition, are scale invariant). Note that translation, scale and rotation invariance are properties that any outlyingness measure should satisfy: if an observation is considered an outlier for a given dataset, the same observation should still be considered an outlier if the dataset is simply translated, rotated or re-scaled.
The robustness of an estimator is usually measured by the breakdown point (Donoho and Huber 1983). The finite-sample breakdown point is the largest proportion of outliers that an estimator can tolerate. More rigorously: given a sample , let be a contaminated sample obtained from by changing points arbitrarily; then the finite-sample breakdown point of is , where is the smallest for which there is a sequence of contaminated samples such that . The finite-sample breakdown point of is defined analogously. The asymptotic breakdown point is the limit of as goes to infinity, if the limit exists. The highest asymptotic breakdown point attainable by an equivariant estimator is .50 (Lopuhaä and Rousseeuw 1991).
This proposition shows that the asymptotic breakdown point of the trimmed estimators is . Then, if , the breakdown point is just the trimming proportion , and the optimal breakdown point can be attained with . In practice, though, such estimators are very inefficient when the actual proportion of outliers is much less than 50%, as we will show by simulation in Section sec:Simulations. A better alternative is to use soft trimming, as explained in Section sec:Definition.
3.2Population versions and properties
The estimators (Equation 1) and (Equation 2) can be generalized to any probability measure on , of which ( eq:mu_hat_gen) and (Equation 2) can be seen as particular cases obtained for , the empirical measure on the sample . One of the reasons this generalization is useful is that it allows us to study the consistency of the estimators: since when the s are i.i.d. with distribution , under certain conditions (Fernholz 1983; Van der Vaart 1998, ch. 20) and will converge in probability to their respective population versions and .
The derivation of and is as follows. Let be a stochastic process with distribution . Define for each . The radius of the smallest ball centered at with probability is , where is the usual quantile function. Then is the -radius around , and if , the weight function has the form , with as in Section ?. Then
The eigenvalues and eigenfunctions of will be denoted by and , respectively.
The following proposition shows that and are well-defined for any probability distribution on , even if does not have finite moments of any order.
The next proposition shows that is really a measure of outlyingness, in the sense that is larger in regions of where is less concentrated.
The equivariance of and carries over to and (the proof is given in the technical supplement). A consequence of the translation equivariance of is the following:
To study the population versions of the trimmed principal components let us assume that admits, with probability 1, the decomposition
where , the s are real random variables, is an orthonormal system, and is a strictly positive non-increasing sequence with ; the set of indices is countable but may be finite or infinite. This decomposition holds, for instance, if , and is known as the Karhunen–Lo ève decomposition (Ash and Gardner 1975, ch. 1.4). In that case , the s and the s are the eigenfunctions and eigenvalues of the covariance operator, and are uncorrelated with and .
But expansion (Equation 4) also holds in some situations where , providing a meaningful notion of heavy-tailed distributions for functional spaces. For instance, if the s in ( eq:K-L_decomposition) are independent, Kolmogorov’s Three Series Theorem (Gikhman and Skorokhod 2004, p. 384) implies that converges almost surely in if and only if for every . The latter is satisfied whenever the s go to zero fast enough, even if the s do not have finite moments of any order. For example, if the s have a Cauchy distribution,
and the right-hand side is finite for any as long as .
Model (Equation 4) is also useful to characterize the two types of outliers that may be present in a functional sample. One type of outliers would be observations that satisfy model (Equation 4 ) but with extreme values of the s, which we call intrinsic outliers, since they belong to the space generated by the s. Another type of outliers would be those that do not follow model ( eq:K-L_decomposition) at all, which we denominate extrinsic outliers, since they fall outside the subspace of where the good data lives. To exemplify the difference: suppose that a sample of curves shows a prominent feature, such as a peak, and the leading principal component explains variability around this peak (a usual situation). An intrinsic outlier would be a curve with an unusual peak (either too flat or too sharp compared to the other curves), whereas an extrinsic outlier would be an observation with a peak at a different location, where the rest of the data shows no such feature. Our estimators can handle both types of outliers, since the interdistances make no distinction between the two types (although extrinsic outliers are easier to spot). The outliers considered in the simulations (Section ?) are intrinsic outliers, while those in the examples (Section ?) are mostly extrinsic outliers.
Note that under model (Equation 4) the interdistances satisfy , so the distribution of the s (and therefore of the radii) depends entirely on the s and the s, not on or the s. This implies the following:
This result implies that the set of principal components of , , coincides with the set , but it cannot be said in general that for each , because the sequence is not necessarily decreasing. The reason is that although is stochastically greater than when and the s are identically distributed, this does not imply that is stochastically greater than in general, so it cannot be guaranteed that . However, ( eq:lambda_tilde) does imply that if and only if , so the dimension of the model is preserved.
We ran a Monte Carlo study to assess the comparative performance of the following estimators:
The sample mean and sample principal components.
The spatial median and spherical principal components (Locantore et al. 1999, Gervini 2008). The spatial median is defined as the that minimizes , and the spherical principal components are defined as the principal components of the normalized sample , i.e. the eigenfunctions of the covariance operator
Trimmed estimators based on the deviations , where is the spatial median, with 20% and 50% trimming. The observations with the largest 20% or 50% deviations where eliminated and the mean and principal components of the remaining data was computed.
Trimmed estimators based on -depth (Cuevas et al. 2007), with 20% and 50% trimming. The -depth of a datum is defined as
where for some kernel function . Following Cuevas et al. (2007), we take as the Gaussian density and as the 20th percentile of the set of -interdistances (there is no clear rationale for this choice but we used the same tuning parameters as Cuevas et al. in order to make our simulation results comparable to theirs). Note that a small, not a large, value of would indicate that is an outlier, so we trim those observations with small value of . In the extensive simulations run by Cuevas et al., these estimators outperformed the estimators of Fraiman and Muniz (2001) and some projection-based estimators, so we did not include the latter in our simulations.
Trimmed estimators based on band depth (López-Pintado and Romo 2009), with 20% and 50% trimming. The band depth is computed as follows. Given real-valued functions defined on some interval , with , the -band spanned by these functions is
For a given curve and a sample , let be the average number of sample -bands that contain the graph of ; that is,
where . The -depth of the curve is defined as . As recommended by López-Pintado and Romo (2009), we use . Once again, outliers are indicated by small values of , so we trim the observations with smallest values of .
The trimmed estimators introduced in this article, with hard and soft rejection weights. For hard-rejection weights, 20% and 50% fixed trimming was considered as well as the adaptive estimated with Gamma mixtures; for the soft-rejection weight (Equation 3), the parameters and were used. In all cases, the radii were computed with (simulations with were also run but not reported here, because the estimator’s performance was uniformly worse than for ).
The data was generated following model (Equation 4) with and , for . The s followed different distributions for each scenario, as explained below. Two sequences of eigenvalues were considered: a slow-decaying sequence (Model 1), and a fast-decaying sequence (Model 2); note that in both cases. Model 2 is practically a finite-dimensional model, since the first five terms accumulate 97% of the variability; Model 1, on the other hand, needs 31 terms to accumulate the same proportion of the variability, so it is an infinite-dimensional model for practical purposes. For actual data generation we truncated Model 1 at the 1000th term and Model 2 at the 10th term, which represent 99.9% of the total variability in both cases. The sample size was in all cases, and the curves were discretized at an equally spaced grid of 100 points. Each sampling situation was replicated 2000 times; the mean absolute errors reported in Tables ? and tab:Simulations_PC_K3 are accurate up to two significant places (we do not report Monte Carlo standard errors for reasons of space and readability).
Regarding the distribution of the s, we were interested in three situations: (i) non-contaminated Normal data, (ii) outlier-contaminated Normal data, and (iii) non-Normal data (specifically, data with heavier tails than Normal). For case (i) we generated i.i.d.s with distribution. For case (ii) we considered two scenarios: to study the robustness of the location estimators, we generated outliers by adding to sample curves (which creates a bias in ); to study the robustness of the estimators of , we generated outliers by adding to sample curves and subtracting the same quantity from other sample curves (this contamination inflates the variability in the direction of , creating a bias in without affecting ). Four values of were considered: , , , and . For case (iii) we generated i.i.d. s with Student’s distribution, with degrees of freedom equal to 1, 2, and 3.
Table ? reports the mean absolute error for each estimator and each sampling distribution. Since there are 12 estimators and 8 sampling distributions it is hard to make conclusions at a glance. To facilitate comparisons, we ranked the estimators in increasing order of error for each sampling distribution, and computed the average rank for each estimator; this average rank is given in the last column of Table ?. We see that the comparative performance of the estimators is similar under both models. The soft-trimmed estimators show the best overall performance, since they have the smallest average ranks; in the other extreme, the band-depth 50%-trimmed estimators show the worst overall performance. Looking into the numbers in more detail, we see that hard-trimmed estimators with 20% trimming perform poorly for contaminated Normal distributions with and for the Cauchy distribution. Among hard-trimmed estimators with 50% trimming, our estimator and the -depth-based estimator are comparable, the former being better for contaminated Normals with (and significantly better for ) and the latter being slightly better in the other situations. The soft-trimmed estimator shows an intermediate behavior between the 20% and 50% hard-trimmed estimators; even at the most extreme cases of the 40% contaminated Normal and the Cauchy distribution, its estimation error is not much larger than that of the 50% hard-trimmed estimator. The adaptive estimator also shows an intermediate behavior between the 20% and 50% hard-trimmed estimators, except for the Cauchy distribution, for which it does not even seem to be well defined (the same can be said for the estimators based on band depth); this is not entirely surprising, since the Cauchy distribution produces a single heavy-tailed distribution of radii rather than a mixture. All things considered, the soft-trimmed estimator seems to offer the best trade-off between robustness and efficiency.
For the principal component estimators, the mean absolute errors are reported in Table tab:Simulations_PC_K3. Breakdown of occurs when is orthogonal to , in which case , so the errors are always bounded. The best-ranked estimators are now the spherical principal components, which is rather unexpected, but looking into the numbers in more detail, we see that this is mostly because of their low errors for distributions. Their performance for contaminated Normal distributions is not good, showing very large errors for contamination proportions as small as 20%. Our hard-trimmed estimators and the -depth-based estimators show comparable performances, although once again the soft-trimmed estimator offers a better trade-off between robustness and efficiency: although it breaks down for the 40%-contaminated Normal, it has much lower estimation errors than the 50%-hard-trimmed estimators for lower levels of contamination and for distributions (even for the Cauchy). The adaptive estimator does no break down for the 40%-contaminated Normal, but it does for the Cauchy distribution.
The similar behavior of the estimators based on the radii and those based on the -depth is not accidental, because both are based on metric notions of data depth: the -radius measures the distance between a given datum and its closest th neighbor, while the -depth essentially counts the number of observations within a fixed distance of a given datum; so these measures are, in a way, the dual of one another. However, the -radii have certain advantages over the -depth: the parameter that defines the radii is an interpretable quantity, while the parameter that defines the -depth is an arbitrary bandwidth with an unknown effect on the estimator’s properties. Also, the breakdown point of the estimators based on -radii is known, while the breakdown point of the estimators based on -depth is unknown.
In contrast with these metric notions of data depth, the band depth of Ló pez-Pintado and Romo (2009) is not based on distances but on the number of bands that cover each sample function. Therefore the trimmed estimators based on band depth behave very differently (in fact, much worse) than those based on -radii or -depth. Two additional disadvantages of band-depth trimming are that the determination of all the bands that cover a given curve is a combinatorial problem, which is unfeasible for large sample sizes, and that generalizing the concept of band to Euclidean spaces beyond univariate curves is not obvious.
As explained in Mortensen and Bro (2006), enzyme cultivation processes often require quick on-line adjustments that demand fast and reliable quality control tools. Samples of the cultivation broth are typically taken at regular time intervals, and enzyme activity is measured. The traditional off-line chemical analyses determine enzyme activity directly and accurately, but it may take hours or days to get the results back from the laboratory. An alternative is to employ multi-channel fluorescence sensors that produce immediate results in the form of excitation-emission matrices (EEMs), although enzyme activity can be determined only indirectly from the EEMs (via principal component regression or partial least squares).
Mortensen and Bro (2006) analyze a dataset of 338 EEMs, available at http://www.models.life.ku.dk/research/data/. A movie showing these 338 EEMs in quick succession is available on the author’s website. A few atypical EEMs can be spotted at the end of the movie. Taking logarithms of the EEMs ameliorates the effect of the outliers to some extent, but not completely, as Figure ? shows (a movie showing the log-EEMs is also available on the author’s website).
In principle, an EEM is a two-dimensional array consisting of light intensity measured at certain excitation and emission wavelengths. Mortensen and Bro (2006) use 15 excitation filters ranging from 270 to 550 nm, and 15 emission filters ranging from 310 to 590 nm; all filters have a maximum half-width of 20 nm. Since emission wavelength must be longer than excitation wavelength, the EEMs are actually triangular arrays: of the possible excitation/emission combinations, only 120 yield actual measurements. This problem could be approached as a classical multivariate problem of dimension and sample size , and some of the robust methods reviewed by Filzmoser et al. (2009) for the large , small problem could be applied. However, since light intensity is a continuous function of the excitation and emission wavelengths, it is more appropriate and statistically efficient to approach this problem as a functional-data problem; the 120 measurements are just an arbitrary discretization of the continuous surfaces , which live in . This is a Euclidean space with inner product , the mean of is the bivariate function and the covariance operator of can be represented as
We carried out a principal component analysis on the log-EEMs. A histogram of the radii (Figure ?) shows that 15 observations are clear outliers, so we computed the 5% trimmed mean and principal components (Figure ?). Among the 20 leading components, the first one (Figure ?(b)) accounts for 59% of the variability, and the second one (Figure ?(c)) accounts for 20% of the variability. We also computed the sample mean and principal components; among the 20 leading components, the first one (Figure ? (e)) accounts for 88% of the variability, and the second one (Figure fig:EEM_mean_pcs(f)) for only 5%.
While the sample mean (Figure ?(d)) is not very different from the trimmed mean (Figure ?(a)), the first principal component is seriously affected by the outliers. The first sample principal component only explains how the outliers vary from the good observations; it may be useful for outlier detection, but it’s not associated with any genuine source of variability. The first trimmed component, in contrast, is genuinely the main direction of variability of the clean data.
The second trimmed component and the second sample principal component are very similar, but the latter underestimates the relative importance of the component, assigning it only 5% of the total variability. This is bad because the second component is the one primarily associated with enzyme activity. Mortensen and Bro (2006) provide an enzyme activity measure for calibration, and the correlation coefficient (after eliminating the 15 outliers) between enzyme activity and the second trimmed component is .69. This association could be overlooked if the user based his analysis on the non-robust sample principal components and decided that the second component was negligible.
The planar trajectory of a pen tip is a curve in , where is time. Then the analysis of handwritten digits can be approached as a functional data problem in the Euclidean space endowed with the inner product . The mean trajectory is and the covariance operator can be represented as
with . In this section we analyze a set of 1055 handwritten samples of the digit five, available at the Machine Learning Repository of the University of California at Irvine, http://archive.ics.uci.edu/ml/. The data was rotated and scaled so that and range between 0 and 100, and between 0 and 1. Eight sample digits are shown in Figure ?.
A plot of the sample mean (Figure ?(a)) does not resemble a five or any other recognizable digit. To understand why this happens, we computed the radii for different values of and noticed that their distribution becomes increasingly bimodal as increases. The histogram for is shown in Figure ?. There are two neatly distinguishable groups: 627 observations with , and 428 observations with . The large number of observations in the second group (40.5% of the data) suggests that the sample may be made up of two systematic clusters, rather than a single homogeneous group and a few isolated outliers.
This is confirmed by a plot of the 41% trimmed mean (Figure fig:digit_means(b)), together with the mean of the observations that were cut off (Figure ?(c)). It turns out that there are two ways to draw the number five. The most common way is in two strokes, beginning at the upper left corner and moving downwards, then raising the pen to draw the top dash (but our planar representation of the trajectory does not capture this vertical movement explicitly). The other way, less common, is to draw the number five in a single stroke, like the letter S. Figure ? (b) corresponds to the first class and Figure ?(c) corresponds to the second one.
As in the EEMs example, the sample principal components do not provide much useful information except for discrimination. The trimmed principal components, on the other hand, do provide useful information about the directions of variability in the bigger cluster. The easiest way to interpret the principal components is to plot their effects on the mean (Figure ?). This figure shows the trimmed mean and the first two trimmed principal components (Figure ? (a,b)), as well as the mean and the first two principal components of the observations that were cut off (Figure ?(c,d)). The first trimmed principal component (Figure ?(a)) explains 56% of the variability and is associated with variation in the inclination of the belly of the digit. The second trimmed principal component (Figure ?(b)) explains 14% of the variability and is mostly associated with variation in the inclination of the vertical dash. Regarding the components of the second type of fives, the first principal component (Figure ?(c)) accounts for 44% of the variability and is associated with variation in the roundness of the five : negative scores correspond to rounded S-shaped digits, while positive scores correspond to more angular Z-shaped digits. The second principal component (Figure ?(d)) accounts for 20% of the variability and explains variability in the width of the digit.
This research was supported by the National Science Foundation, grant DMS 0604396.