1 Introduction

Optimal Bayes Classifiers for Functional Data and Density Ratios

July 9, 2019

Short title: Functional Bayes Classifiers

Xiongtao Dai111Corresponding author


Department of Statistics

University of California, Davis

Davis, CA 95616 U.S.A.

Email: dai@ucdavis.edu


Hans-Georg Müller222Research supported by NSF grants DMS-1228369 and DMS-1407852


Department of Statistics

University of California, Davis

Davis, CA 95616 U.S.A.

Email: hgmueller@ucdavis.edu


Fang Yao333Research supported by the Natural Sciences and Engineering Research Council of Canada

Department of Statistics

University of Toronto

100 St. George Street

Toronto, ON M5S 3G3

Canada

Email: fyao@utstat.toronto.edu

ABSTRACT

Bayes classifiers for functional data pose a challenge. This is because probability density functions do not exist for functional data. As a consequence, the classical Bayes classifier using density quotients needs to be modified. We propose to use density ratios of projections on a sequence of eigenfunctions that are common to the groups to be classified. The density ratios can then be factored into density ratios of individual functional principal components whence the classification problem is reduced to a sequence of nonparametric one-dimensional density estimates. This is an extension to functional data of some of the very earliest nonparametric Bayes classifiers that were based on simple density ratios in the one-dimensional case. By means of the factorization of the density quotients the curse of dimensionality that would otherwise severely affect Bayes classifiers for functional data can be avoided. We demonstrate that in the case of Gaussian functional data, the proposed functional Bayes classifier reduces to a functional version of the classical quadratic discriminant. A study of the asymptotic behavior of the proposed classifiers in the large sample limit shows that under certain conditions the misclassification rate converges to zero, a phenomenon that has been referred to as “perfect classification”. The proposed classifiers also perform favorably in finite sample applications, as we demonstrate in comparisons with other functional classifiers in simulations and various data applications, including wine spectral data, functional magnetic resonance imaging (fMRI) data for attention deficit hyperactivity disorder (ADHD) patients, and yeast gene expression data.

Key words and phrases:  Common functional principal component, density estimation, functional classification, Gaussian process, quadratic discriminant analysis.

AMS Subject Classification: 62M20, 62C10, 62H30

1 Introduction

In classification of functional data, predictors may be viewed as random trajectories and responses are indicators for two or more categories. The goal of functional classification is to assign a group label to each predictor function, i.e., to predict the group label for each of the observed random curves. Functional classification is a rich topic with broad applications in many areas of commerce, medicine and the sciences, and with important applications in pattern recognition, chemometrics and genetics (Song et al. 2008; Zhu et al. 2010, 2012; Francisco-Fernández et al. 2012; Coffey et al. 2014). Within the functional data analysis (FDA) framework (Ramsay and Silverman 2005), each observation is viewed as a smooth random curve on a compact domain. Functional classification also has been recently extended to the related task of classifying longitudinal data (Wu and Liu 2013; Wang and Qu 2014) and has close connections with functional clustering (Chiou and Li 2008).

There is a rich body of papers on functional classification, using a vast array of methods, for example distance-based classifiers (Ferraty and Vieu 2003; Alonso et al. 2012), -nearest neighbor classifiers (Biau et al. 2005; Cérou and Guyader 2006; Biau et al. 2010), Bayesian methods (Wang et al. 2007; Yang et al. 2014), logistic regression (Araki et al. 2009), or partial least squares (Preda and Saporta 2005; Preda et al. 2007),

It is well known that Bayes classifiers based on density quotients are optimal classifiers in the sense of minimizing misclassification rates (see for example Bickel and Doksum 2000). In the one-dimensional case, this provided one of the core motivations for nonparametric density estimation (Fix and Hodges Jr 1951; Rosenblatt 1956; Parzen 1962; Wegman 1972) but for higher-dimensional cases an unrestricted nonparametric approach is subject to the curse of dimensionality (Scott 2015) and this leads to very slow rates of convergence for estimating the nonparametric densities for dimensions larger than three or four. This renders the resulting classifiers practically worthless. The situation is exacerbated in the case of functional predictors, which are infinite-dimensional and therefore associated with a most severe curse of dimensionality. This curse of dimensionality is caused by the small ball problem in function space, meaning the expected number of functions falling into balls with small radius is vanishingly small, which implies that densities do not even exist in most cases (Li and Linde 1999; Delaigle and Hall 2010).

As a consequence, in order to define a Bayes classifier through density quotients with reasonably good estimation properties, one needs to invoke sensible restrictions. These could be for example restrictions of the class of predictor processes, an approach that has been adopted in Delaigle and Hall (2012), who consider two Gaussian populations with equal covariance using a functional linear discriminant, which is analogous to the linear discriminant. This is the Bayes classifier in the analogous multivariate case. Galeano et al. (2015) propose a functional quadratic method for discriminating two general Gaussian populations, making use of a suitably defined Mahalanobis distance for functional data.

In contrast to these approaches, we aim here at the construction of a nonparametric Bayes classifier for functional data. To achieve this, we project the observations onto an orthonormal basis that is common to the two populations, and construct density ratios through products of the density ratios of the projection scores. The densities themselves are nonparametrically estimated, which is feasible as they are one-dimensional. We also provide an alternative implementation of the proposed nonparametric functional Bayes classifier through nonparametric regression. This second implementation of functional Bayes classifiers sometimes works even better than the direct approach through density quotients in finite sample applications.

We obtain conditions for the asymptotic equivalence of the proposed functional nonparametric Bayes classifiers and their estimated versions, and also for asymptotic perfect classification when using our classifiers. The term “perfect classification” was introduced in Delaigle and Hall (2012) to denote conditions where the misclassification rate converges to zero, as the sample size increases, and we use it in the same sense here. Perfect classification in the Gaussian case requires that there are certain differences between the mean or covariance functions, while such differences are not a prerequisite for the nonparametric approach to succeed. In the special case of Gaussian functional predictors, the proposed classifiers simplify to those considered in Delaigle and Hall (2013). Additionally, we extend our theoretical results to cover the practically important situation where the functional data are not fully observed, but rather are observed as noisy measurements that are made on a dense grid.

In section 2, we introduce the proposed Bayes classifiers and their estimates are discussed in section 3. We do not require knowledge about the type of underlying processes that generate the functional data. One difficulty for the theoretical analysis that will be addressed in section 4 is that the projection scores themselves are not available but rather have to be estimated from the data. Practical implementation, simulation results and applications to various functional data examples are discussed in subsection 5.1, subsection 5.2 and subsection 5.3, respectively. We demonstrate that the finite sample performance of the proposed classifiers in simulation studies and also for three data sets is excellent, specifically in comparison to the functional linear (Delaigle and Hall 2012), functional quadratic (Galeano et al. 2015), and functional logistic regression methods (James 2002; Müller and Stadtmüller 2005; Leng and Müller 2006; Escabias et al. 2007).

2 Functional Bayes Classifiers

We consider the situation where the observed data come from a common distribution , where is a fully observed square integrable random function in , is a compact interval, and is a group label. Assume is distributed as if is from population , , that is, is the conditional distribution of given . Also let be the prior probability that an observation falls into . Our aim is to infer the group label of a new observation .

The optimal Bayes classification rule that minimizes misclassification error classifies an observation to if

(1)

where we denote realized functional observations by and random predictor functions by . If the conditional densities of the functional observations exist, where conditioning is on the respective group label, we denote them as and when conditioning on group 0 or 1. Then the Bayes theorem implies

(2)

However, the densities for functional data do not usually exist (see Delaigle and Hall 2010). To overcome this difficulty, we consider a sequence of approximations to the functional observations, where the number of components is increasing, and then use the density ratios (2).

Our approach is to first represent and the random by projecting onto an orthogonal basis . This leads to the projection scores and , where and , . As noted in Hall et al. (2001), when comparing the conditional probabilities, it is sensible to project the data from both groups onto the same basis. Our goal is to approximate the conditional probabilities by , where . Then by Bayes theorem,

(3)

where and are the conditional densities for the first random projection scores .

Estimating the joint density of is impractical and subject to the curse of dimensionality when is large, so it is sensible to introduce reasonable conditions that simplify (3). A first simplification is to assume the auto-covariances of the stochastic processes that generate the observed data have the same ordered eigenfunctions for both populations. Specifically, write , and define the covariance operators of as

Assuming is continuous, by Mercer’s theorem

(4)

where are the eigenvalues of and are the corresponding orthonormal eigenfunctions, , and , for . The common eigenfunction condition then is , for (Flury 1984; Benko et al. 2009; Boente et al. 2010; Coffey et al. 2011). We note that this assumption can be weaken to that the two populations share the same set of eigenfunctions, not necessarily with the same order. In this case, we reorder the eigenfunctions and eigenvalues such that holds, but are not necessarily in descending order for .

Choosing the projection directions as the shared eigenfunctions , one has if . We note that in general the score is not the functional principal component (FPC) .

A second simplification is that we assume that the projection scores are independent under both populations. Then the densities in (3) factor and the criterion function can be rewritten by taking logarithm as

(5)

where is the density of the th score under . We classify into if and only if Due to the zero divided by zero problem, (5) is defined only on a set with . Our theoretical arguments in the following are restricted to this set. For the asymptotic analysis we will consider the case where as .

When predictor processes are Gaussian for , the projection scores are independent and one may substitute Gaussian densities for the densities in (5). Define the th projection of the mean function of as

Then in this special case of our more general nonparametric approach, one obtains the simplified version

(6)

Here either converges to a random variable almost surely if and , or otherwise diverges to or almost surely, as . More details about the properties of can be found in Lemma 2 in appendix A.3. It is apparent that (6) is the quadratic discriminant rule using the first projection scores, which is the Bayes rule for multivariate Gaussian data with different covariance structures. If further then one has equal covariances and (6) reduces to the functional linear discriminant (Delaigle and Hall 2012).

Because our method does not assume Gaussianity and allows for densities of general form in (5), we may expect better performance than Gaussian-based functional classifiers when the population is non-Gaussian. In practice the projection score densities are estimated nonparametrically by kernel density estimation (Silverman 1986) or in the alternative nonparametric regression version by kernel regression (Nadaraya 1964; Watson 1964), as described in section 3.

3 Estimation

We first estimate the common eigenfunctions by pooling data from the both groups to obtain a joint covariance estimate. Since we assume that the eigenfunctions are the same, while eigenvalues and thus covariances may differ, we can write where the are the common eigenfunctions. We define the joint covariance operator . Then is also the th eigenfunction of with eigenvalue .

Assume we have functional predictors and from and , respectively. In practice, the assumption that functional data for which one wants to construct classifiers are fully observed is often unrealistic. Rather, one has available dense observations that have been taken on a regular or irregular design, possibly with some missing observations, where the measurements are contaminated with independent measurement errors that have zero mean and finite variance. In this case, we first smooth the discrete observations to obtain a smooth estimate for each trajectory, using local linear kernel smoothers, and then regard the smoothed trajectory as a fully observed functional predictor. In our theoretical analysis, we justify this approach and show that we obtain the same asymptotic classification results as if we had fully observed the true underlying random functions. Details about the pre-smoothing and the resulting classification will be given right before Theorem 4 in section 4, where this theorem provides theoretical justifications for the pre-smoothing approach by establishing asymptotic equivalence to the case of fully observed functions, under suitable regularity conditions.

We estimate the mean and covariance functions by and , the sample mean and sample covariance function under group , respectively, and estimate by . Setting and denoting the th eigenvalue-eigenfunction pair of by , we obtain and represent the projections for a generic functional observation by , . We denote the th estimated projection score of the th observation in group by . The eigenvalues are estimated by , which is motivated by , and the pooled eigenvalues by . We estimate the th projection scores of by . We observe that , , , and will be consistently estimated, with details in appendix A.

We then proceed to obtain nonparametric estimates of the densities for each of the projection scores. For this, we use kernel density estimates, applied to the sample projection scores from group . The kernel density estimate (Silverman 1986) for the th component in group is given by

(7)

where , and is the bandwidth adapted to the variance of the projection score. The bandwidth multiplier is the same for all projection density estimates and will be specified in section 4 and subsection 5.1. These estimates then lead to estimated density ratios .

An alternative approach for estimating the density ratios is via nonparametric regression. This is motivated by the Bayes theorem, as follows,

(8)

where is the marginal density of the th projection. This reduces the construction of nonparametric Bayes classifiers to a sequence of nonparametric regressions . These again can be implemented by a kernel method (Nadaraya 1964; Watson 1964), smoothing the scatter plots of the pooled estimated scores of group , which leads to the nonparametric estimators

where is the bandwidth. This results in estimates that we plug-in at the right hand side of (8), which then yields an alternative estimate of the density ratio, replacing the two kernel density estimates by just one nonparametric regression estimate .

Writing , the estimated criterion function based on kernel density estimate is thus

(9)

while the estimated criterion function based on kernel regression is

(10)

4 Theoretical Results

In this section we present the asymptotic equivalence of the estimated version of the Bayes classifiers to the true one. For the first three main results in Theorems 1-3 we assume fully observed functional predictors, and then in Theorem 4 we show that these results can be extended to the practically more relevant case where predictor functions are not fully observed, but are only indirectly observed through densely spaced noisy measurements. Following Delaigle and Hall (2012), we use the term “perfect classification” to mean the misclassification rate approaches zero as more projection scores are used, and proceed to give conditions for the proposed nonparametric Bayes classifiers to achieve perfect classification. All proofs are in the appendix A.

For theoretical considerations only we assume the following simplifications that can be easily bypassed. Without loss of generality we denote the mean functions of and as 0 and , respectively, since we can subtract the mean function of from all samples, whereupon becomes the difference in the mean functions, and stands for the th projection of the mean function. We also assume and . We use a common multiplier for all bandwidths in the kernel density estimates and in the kernel regression estimates, for all and .

We need the following assumptions:

  • The covariance operators under and have common eigenfunctions;

  • For all , the projection scores onto the common eigenfunctions are independent under and , and their densities exist.

The common eigenfunction assumption (A1) means the covariance functions under and can be decomposed as

where the are the common eigenfunctions and are the associated eigenvalues. This means that the major modes of variation are assumed to be the same for both populations, while the variances in the common eigenfunction directions might change. In practice, the common eigenfunction assumption allows for meaningful comparisons of the modes of variation between groups, as it makes it possible to reduce such comparisons to comparisons of the functional principal components, as suggested by Coffey et al. (2011); for our analysis, the common eigenfunctions are convenient projection directions, and are assumed to be such that the projection scores become independent, as is for example the case if predictor processes satisfy the more restrictive Gaussian assumption. The common eigenfunction assumption is weaker than the shared covariance assumption as it allows for different eigenvalues between groups and thus for different covariance operators across groups.

Theorem 1 below states as in (9) is asymptotically equivalent to as in (5), for all . We define the kernel density estimator using the true projection scores as

Let be the density functions of the (standardized) FPCs when and that of when , be the kernel density estimates of using the estimated FPCs, and be the kernel density estimates using the true FPCs, analogous to and . Delaigle and Hall (2010) provide the uniform convergence rate of to on a compact domain, with detailed proof available in Delaigle and Hall (2011), and our derivation utilizes this result.

We make the following assumptions (B1)–(B5) for , in which (B1)–(B4) parallel assumptions (3.6)–(3.9) in Delaigle and Hall (2010), namely

  • For all large and some , and ;

  • For each integer , is bounded uniformly in ;

  • The eigenvalues are all different, and so are the eigenvalues in each of the sequences , for ;

  • The densities are bounded and have a bounded derivative; the kernel is a symmetric, compactly supported density function with two bounded derivatives; for some and is bounded away from zero as .

  • The densities are bounded away from zero on any compact interval within their respective support, i.e. for all compact intervals , for and .

Note that (B1) is a Hölder continuity condition for the process , which is a slightly modified version of a condition in Hall and Hosseini-Nasab (2006) and Hall and Hosseini-Nasab (2009), and that (B2) is satisfied if the standardized FPCs have moments of all orders that are uniformly bounded. In particular, Gaussian processes satisfy (B2) since the standardized FPCs identically follow the standard normal distribution. Recall that the in (B3) are the eigenvalues of the pooled covariance operator, and (B3) is a standard condition (Bosq 2000). (B4) and (B5) are needed for constructing consistent estimates for the density quotients.

Theorem 1.

Assuming (A1), (A2), and (B1)–(B5), for any there exist a set with and a sequence such that as .

Theorem 1 provides the asymptotic equivalence of the estimated classifier based on the kernel density estimates (7) and the true Bayes classifier. This implies that it is sufficient to investigate the asymptotics of the true Bayes classifier to establish asymptotic perfect classification. The following theorem establishes an analogous result about the equivalence of the estimated classifier based on kernel regression and the true Bayes classifier.

Theorem 2.

Assuming (A1), (A2), and (B1)–(B5), for any there exist a set with and a sequence such that as .

Our next result shows that the proposed nonparametric Bayes classifiers achieve perfect classification under certain conditions. Intuitively, the following theorem describes when the individual pieces of evidence provided by each of the independent projection scores add up strong enough for perfect classification. Let and . We impose the following conditions on the standardized FPCs:

  • The densities and are uniformly bounded for all .

  • The first four moments of under and those of under are uniformly bounded for all .

Theorem 3.

Assuming (A1), (A2), and (C1)–(C2), the Bayes classifier achieves perfect classification if or , as .

Note that in general the conditions for perfect classification in Theorem 3 are not necessary. The general case that we study here has the following interesting feature. When and are non-Gaussian, perfect classification may occur even if the mean and covariance functions under the two groups are the same, because one has infinitely many projection scores to obtain information, each possibly having strong discrimination power due to the different shapes of distributions under different groups.

Consider the following example. Let the projection scores be independent random variables with mean 0 and variance that follow normal distributions under and Laplace distributions under . Then

(11)

Since centered normal and Laplace distributions form a scale family, we have that have a common standard distribution under , irrespective of . Denoting the summand of (11) by , this implies are independent and identically distributed. Note that , , and has finite variance under either population. So the misclassification error under is

where the inequality is due to Chebyshev’s inequality and the last equality is due to are identically and independently distributed. Similarly, the misclassification error under also goes to zero as . Therefore perfect classification occurs under this non-Gaussian case where both the mean and covariance functions are the same. This provides a case where attempts at classification under Gaussian assumptions are doomed, as mean and covariance functions are the same between the groups.

In practice we observe only discrete and noisy measurements

for the th functional predictors in group , for , where is the number of measurements per curve. We smooth the noisy measurements by local linear kernel smoothers and obtain , targeting the true predictor . More precisely, for each we let , where

is the kernel, and is the bandwidth for pre-smoothing. We let and be the sample mean and covariance functions of the smoothed predictors in group , and . Also, let be the th eigenfunction of for . Denote and as the projection score for a random or fixed function onto . Then we use kernel density estimates

(12)

analogous to (7).

To obtain theoretical results under pre-smoothing, we make regularity assumptions (D1)–(D4), which parallel assumptions (B2)–(B4) in the supplementary material of Kong et al. (2016):

  • For , is twice continuously differentiable on with probability 1, and

  • For and , are considered deterministic and ordered increasingly. There exist design densities which are uniformly smooth over satisfying and that generate according to , where is the inverse of .

  • For each , there exist a common sequence of bandwidth such that , where is the bandwidth for smoothing . The kernel function for local linear smoothing is twice continuously differentiable and compactly supported.

  • Let and . , is of order , and , where is the common bandwidth multiplier in kernel density estimates.

Let be the classifier using components analogous to in (9), but with kernel density estimates constructed with the pre-smoothed predictors. Under the above assumptions, we obtain an extension of Theorem 1.

Theorem 4.

Assuming (A1), (A2), (B1)–(B5), and (D1)–(D4), for any there exist a set with and a sequence such that as .

5 Numerical Properties

5.1 Practical Considerations

We propose three practical implementations for estimating the projection score densities that will be compared in our data illustrations, along with other previously proposed functional classification methods. All of these involve the choice of tuning parameters (namely bandwidths and number of included components) and we describe in the following how these are specified.

Our first implementation is the nonparametric density classifier as in (9), where one estimates the density of each projection by applying kernel density estimators to the observed sample scores as in (7). For these kernel estimates we use a Gaussian kernel and the bandwidth multiplier is chosen by 10-fold cross-validation (CV), minimizing the misclassification rate.

The second implementation is nonparametric regression as in (10), where we apply kernel smoothing (Nadaraya 1964; Watson 1964) to the scatter plots of the pooled estimated scores and group labels. For each scatter plot, a Gaussian kernel is used and the the bandwidth multiplier is also chosen by 10-fold CV.

Our third implementation, referred to as Gaussian method and included mainly for comparison purposes, assumes each of the projections to be normally distributed with mean and variance estimated by the sample mean and sample variance of , . We then use the density of as . This Gaussian implementation differs from the quadratic discriminant implementation discussed for example in Delaigle and Hall (2013), as in our approach we always force the projection directions for the two populations to be the same. This has the practical advantage of providing more stable estimates for the eigenfunctions and is a prerequisite for constructing nonparametric Bayes classifiers for functional predictors.

For numerical stability, if the densities are zero we insert a very small lower threshold (100 times the gap between 0 and the next double-precision floating-point number). Finally, the number of projections used in our implementations is chosen by 10-fold CV (together with the selection of for the nonparametric classifiers).

5.2 Simulation Results

We illustrate our Bayes classifiers in several simulation settings. In each setting we generate training samples, each having chance to be from or . The samples are generated by , , where is the number of samples in , . Here the are independent random variables with mean 0 and variance , which are generated under two distribution scenarios: Scenario A, the are normally distributed, and Scenario B, the are centered exponentially distributed. In both scenarios, the are the th function in the Fourier basis, where , , , etc., . We set , and or for the same or the different mean scenarios, respectively. The variances of the scores under are , and those under are or for the same or the different variance scenarios, respectively, for . The random functions are sampled at 51 equally spaced time points from 0 to 1, with additional small measurement errors in the form of independent Gaussian noise with mean 0 and variance 0.01 to each observation for both scenarios. We use modest sample sizes of and for training the classifiers, and 500 samples for evaluating the predictive performance. We repeat each simulation experiment times.

We compare the predictive performance of the following functional classification methods: (1) the centroid method (Delaigle and Hall 2012); (2) the proposed nonparametric Bayes classifier in three versions: Basing estimation on Gaussian densities (Gaussian), nonparametric densities (NPD), or nonparametric regression (NPR), which are the three implementations discussed above; (3) logistic regression; and (4) the functional quadratic discriminant as in Galeano et al. (2015). The functional quadratic discriminant was never the winner for any scenario in our simulation study so we omitted it from the tables.

The results for Scenario A are shown in Table 1, whose upper half corresponds to using the noisy predictors as is, and the lower half corresponds to pre-smoothing the predictors by local linear smoother with CV bandwidth choice. For these cases, the proposed nonparametric Bayes classifiers are seen to have superior performance for those scenarios where covariance differences in the populations are present, while the centroid and the logistic methods work best for those cases where the differences are exclusively in the mean.

Centroid Gaussian NPD NPR Logistic
without pre-smoothing:
50 same diff 49.3 (0.12) 23.8 (0.18) 24.5 (0.21) 26.7 (0.22) 49.4 (0.12)
diff same 40.2 (0.16) 41.5 (0.16) 43.4 (0.17) 42.4 (0.18) 40.7 (0.16)
diff diff 37.9 (0.17) 20.8 (0.18) 21.2 (0.2) 23.3 (0.22) 38.8 (0.17)
100 same diff 49.1 (0.13) 17.2 (0.11) 18.6 (0.12) 20 (0.13) 49.3 (0.13)
diff same 37.8 (0.13) 39.2 (0.13) 41.4 (0.15) 40.2 (0.16) 38.3 (0.13)
diff diff 35.3 (0.14) 14.6 (0.1) 15.8 (0.1) 17.1 (0.12) 35.8 (0.15)
with pre-smoothing:
50 same diff 48.9 (0.14) 22.7 (0.17) 23.1 (0.2) 25.7 (0.21) 48.9 (0.13)
diff same 36.5 (0.24) 38.3 (0.22) 40.7 (0.22) 39.3 (0.23) 32.2 (0.26)
diff diff 33.4 (0.25) 18 (0.16) 18.4 (0.18) 20.3 (0.2) 28.1 (0.26)
100 same diff 48.9 (0.14) 17.1 (0.11) 18.1 (0.12) 19.4 (0.13) 49.1 (0.14)
diff same 29.8 (0.23) 31.6 (0.23) 33.6 (0.25) 31.9 (0.25) 25.4 (0.15)
diff diff 27 (0.24) 13 (0.11) 14 (0.12) 14.8 (0.13) 21.1 (0.14)
Table 1: Misclassification rates (in percent) for Scenario A (Gaussian case), with standard deviation for the mean estimate in brackets. The Gaussian, NPD, and NPR methods correspond to the Gaussian, nonparametric density, and nonparametric regression implementations of the proposed Bayes classifiers, respectively. The upper half corresponds to using the functional predictors with noisy measurements as is, and the lower half corresponds to using pre-smoothed predictors.

In the cases where the covariances differ, the proposed Bayes classifiers have substantial performance advantages over other methods. This is because they take into account both mean and covariance differences between the populations. When the covariances are the same but the means differ, the centroid method is the overall best if we use the noisy predictors while the Gaussian implementation of the proposed Bayes classifiers has comparable performance. This is expected because our method estimates more parameters than the centroid method while both assume the correct model for the simulated data. The quadratic method (not shown) is not performing well for these simulation data because it fails to take into account the common eigenfunction structure. The logistic method gains considerable performance from pre-smoothing, and becomes the winner when only a mean difference is present.

The simulation results for Scenario B are reported in Table 2. The performance of the proposed Bayes classifiers deteriorates somewhat in this scenario but they still perform substantially better than all other methods when covariance differences occur. When there are differences between the covariances, the Gaussian implementation performs the best when the sample size is small, while the nonparametric density implementation performs the best when the sample size is large.

Centroid Gaussian NPD NPR Logistic
without pre-smoothing:
50 same diff 49 (0.13) 30.2 (0.19) 31.2 (0.22) 33.5 (0.23) 49.2 (0.13)
diff same 38.3 (0.21) 40.6 (0.21) 39.5 (0.22) 38.6 (0.21) 38.7 (0.23)
diff diff 35 (0.2) 23.3 (0.18) 23.5 (0.21) 24.3 (0.22) 35.7 (0.22)
100 same diff 48.8 (0.14) 26 (0.13) 25.4 (0.14) 26.7 (0.16) 48.9 (0.13)
diff same 35.8 (0.16) 38.6 (0.19) 36.3 (0.18) 35.7 (0.16) 35.9 (0.16)
diff diff 32.4 (0.14) 18.7 (0.13) 16.7 (0.13) 17 (0.14) 32.7 (0.15)
with pre-smoothing:
50 same diff 48.5 (0.15) 28.3 (0.18) 29.1 (0.21) 31.4 (0.24) 48.6 (0.14)
diff same 35 (0.24) 38.4 (0.22) 38 (0.22) 36.5 (0.23) 30.9 (0.23)
diff diff 30.3 (0.24) 20.2 (0.18) 20.9 (0.22) 21.4 (0.22) 27 (0.23)
100 same diff 48.5 (0.15) 25.1 (0.13) 24 (0.14) 25 (0.14) 48.4 (0.15)
diff same 29.2 (0.23) 33.3 (0.23) 32.3 (0.2) 31.1 (0.21) 25.4 (0.17)
diff diff 26.1 (0.22) 16.5 (0.14) 14.6 (0.13) 14.7 (0.13) 21.6 (0.16)
Table 2: Misclassification rates (in percent) for Scenario B (exponential case), with standard deviation for the mean estimate in brackets. The upper half corresponds to using the functional predictors with noisy measurements as is, and the lower half corresponds to using pre-smoothed predictors.

5.3 Data Illustrations

We present three data examples to illustrate the performance of the proposed Bayes classifiers for functional data. We pre-smooth the yeast data by local linear smoother with CV bandwidth choice since the original observations are quite noisy (shown in Figure 1), while for the wine and the ADHD datasets we just use the original curves which are preprocessed and smooth. Following the procedure described in Benko et al. (2009), we test the common eigenspaces assumption for the first and 20 eigenfunctions and report in Table 3 the p-values obtained from 2000 bootstrap samples. Only one test rejects the null hypothesis that the first eigenspaces are shared by the two populations at 0.1 significance level, which shows the common eigenfunction assumption is reasonable.

ADHD yeast yeast_pre wine_full wine_d1
J = 5 0.31 0.57 0.15 0.098 0.31
J = 20 0.80 0.63 0.76 0.72 0.55
Table 3: The p-values for testing the common eigenspace assumptions, using = 5 or 20 eigenfunctions. We report the results for both the original (yeast) and pre-smoothed version (yeast_pre) version of the yeast gene expression dataset.

We used repeated 10-fold CV misclassification error rates to evaluate the performance of the classifiers. In order to obtain the correct CV misclassification error rate the selection of the number of components and bandwidth is carried out on only the training data in each CV partition. We repeat the process 500 times and report the mean misclassification rates, and the standard deviations of the mean estimates. The misclassification results for different datasets are shown in Table 4.

As can be seen from Table 4, which contains all results for misclassification rates across the compared methods and data sets, the proposed nonparametric Bayes classifiers and the functional quadratic discriminant perform overall well, indicating that covariance operator differences contain crucial information for classification. Among the various implementations of the proposed Bayes classifiers, the Gaussian version performs best for these data. Pre-smoothing the predictors slightly improve the misclassification rate for the yeast dataset. We now provide more details about the various data.

Data Centroid Gaussian NPD NPR Logistic Quadratic
ADHD 41.7 (0.2) 34.1 (0.1) 36.7 (0.2) 36.8 (0.2) 47 (0.2) 34.6 (0.2)
yeast 20.0 (0.08) 12.5 (0.09) 15 (0.1) 14.4 (0.1) 20.8 (0.1) 14.5 (0.09)
yeast_pre 20.7 (0.1) 12.3 (0.06) 14.3 (0.09) 14.1 (0.1) 17.2 (0.09) 14.4 (0.07)
wine_full 6.84 (0.07) 5.08 (0.06) 5.09 (0.06) 4.67 (0.06) 7.56 (0.08) 5.93 (0.08)
wine_d1 7.15 (0.07) 6.99 (0.06) 5.75 (0.06) 5.37 (0.06) 6.64 (0.07) 5.69 (0.07)
Table 4: CV misclassification rates (in percent) for the three example data. ADHD refers to the attention deficit hyperactivity disorder data. The yeast data refers to cell cycle gene expression time course data, and yeast_pre refers to the pre-smoothed version of the yeast data. The wine datasets concern the classification of the original spectra (wine_full) and the first derivative (wine_d1), which is constructed by forming difference quotients.

The first data example concerns classifying attention deficit hyperactivity disorder (ADHD) from brain imaging data. The data were obtained in the ADHD-200 Sample Initiative Project. ADHD is the most commonly diagnosed behavioral disorder in childhood, and can continue through adolescence and adulthood. The symptoms include lack of attention, hyperactivity, and impulsive behavior. We base our analysis on filtered preprocessed resting state data from the New York University (NYU) Child Study Center, called the anatomical automatic labelling atlas (Tzourio-Mazoyer et al. 2002), which contains 116 Regions of Interests (ROI) that have been fractionated into functional resolution of the original image using nearest-neighbor interpolation to create a discrete labelling value for each pixel of the image. The mean blood-oxygen-level dependent signal in each ROI is depicted for 172 equally spaced time points. We use only subjects for which the ADHD index is in the lower quartile (defining ) or upper quartile (defining ), with and , respectively, and regard the group membership as the binary response to be predicted. The functional predictor is taken to be the average of the mean blood-oxygen-level dependent signals of the 91st to 108th regions, shown in Figure 1, corresponding to the cerebellum that has been found to have significant impact on the ADHD index in previous studies (Berquin et al. 1998).

Figure 1: The original functional predictors for the yeast (left panel) and Attention Deficit Hyperactivity Disorder (ADHD, right panel) data. is shown in dashed lines and in solid lines.

Our second data example focuses on yeast gene expression time courses during the cell cycle as predictors, which are described in Spellman et al. (1998). The predictors are gene expression level time courses for genes, observed at 18 equally spaced time points from 0 minute to 119 minutes. The expression trajectories for genes related to G1 phase regulation of the cell cycle were regarded as () and all others are regarded as (). The Gaussian implementation of the proposed Bayes classifiers outperforms the other methods by a margin of at least 2%, while the functional quadratic discriminant is also competitive for this classification problem. Pre-smoothing improves the performance of all classifiers except the centroid method.

In the third example we analyze wine spectra data. These data have been made available by Professor Marc Meurens, Université Catholique de Louvain, at http://mlg.info.ucl.ac.be/index.php?page=DataBases. The dataset contains a training set of 93 samples and a testing set of 30 samples. We combine the training set and test set into a dataset of . For each sample the mean infrared spectrum on 256 points and the alcohol content are observed. consists of the samples with alcohol contents greater than 12 () and () of the rest. We consider both the original observations (wine_full) and the first derivative (wine_d1), which is constructed by forming difference quotients. As for the other examples, the misclassification errors for the various methods are listed in Table 4.

The original functional predictors for the wine example and the mean functions for each group are displayed in the left and the right panel of Figure 2, respectively. There are clear mean differences between the two groups, especially around the peaks, for example at . We show in Figure 3 the kernel density estimates of the first four projection scores, with in dashed lines and in solid lines. Clearly the densities are not normal, and some of them (the first and second projections) appear to be bimodal. The differences between each pair of densities are not limited to location and scale, but also manifest themselves in the shapes of the densities; in the second and fourth plots the density estimate from one group is close to bimodal and the other density is not. The nonparametric implementations of the proposed Bayes estimators based on nonparametric regression or nonparametric density estimation are capable of reflecting such shape differences and therefore outperform the classifiers based on Gaussian assumptions.

Figure 2: The Wine Spectra. The left panel shows the original trajectories and the right panel shows the mean curves for each group. Trajectories of are displayed in dashed lines and those of in solid lines.
Figure 3: Kernel density estimates for the first four projection scores for the wine spectra. is shown in dashed lines and in solid lines.

In all examples, the quadratic discriminant performs better than the centroid method, suggesting that in these examples there is information contained in the differences between the covariance operators of the two groups to be classified. In the presence of such more subtle differences and additional shape differences in the distributions of projection scores the proposed nonparametric Bayes methods work particularly well for functional classification.

Appendix A Technical Arguments

For simplicity of presentation we adopt throughout all proofs the simplifying assumptions mentioned in section 4. We remark that , , , and constructed from the sample mean, covariance, eigenfunctions and eigenvalues of the completely observed functions are consistent estimates for their corresponding targets, as per Hall and Hosseini-Nasab (2006).

a.1 Proof of Theorem 1

Let be a bounded set of all square integrable functions for , where is the norm. We will use the following lemma:

Lemma 1.

Assuming (B1)–(B4), for any , ,

(13)
Proof.

We prove the statement for ; the proof for is analogous. Observe

(14)

where the first rate is due to Delaigle and Hall (2010), and the second to, for example, Stone (1983) or Liebscher (1996). Then