Supervised classification for a family of Gaussian functional models

Supervised classification for a family of Gaussian functional models

Amparo Baíllo111These authors have been partially supported by Spanish grant MTM2007-66632., Juan Antonio Cuesta-Albertos222This author have been partially supported by the Spanish grant MTM2008-0607-C02-02.
E-mail addresses: amparo.baillo@uam.es, cuestaj@unican.es, antonio.cuevas@uam.es
and Antonio Cuevas
Universidad Autónoma de Madrid and Universidad de Cantabria
Abstract

In the framework of supervised classification (discrimination) for functional data, it is shown that the optimal classification rule can be explicitly obtained for a class of Gaussian processes with “triangular” covariance functions. This explicit knowledge has two practical consequences. First, the consistency of the well-known nearest neighbors classifier (which is not guaranteed in the problems with functional data) is established for the indicated class of processes. Second, and more important, parametric and nonparametric plug-in classifiers can be obtained by estimating the unknown elements in the optimal rule.

The performance of these new plug-in classifiers is checked, with positive results, through a simulation study and a real data example.

1 Introduction

Statement of the problem. Notation

Discrimination, also called “supervised classification” in modern terminology, is one of the oldest statistical problems in experimental science: the aim is to decide whether a random observation (taking values in a “feature space” endowed with a distance ) either belongs to the population or to . For example, in a medical problem and could correspond to the group of “healthy” and “ill” individuals, respectively. The decision must be taken from the information provided by a “training sample” . Here , , are independent replications of , measured on randomly chosen individuals, and are the corresponding values of an indicator variable which takes values 0 or 1 according to the membership of the -th individual to or . The term “supervised” refers to the fact that the individuals in the training sample are supposed to be correctly classified, typically using “external” non statistical procedures, so that they provide a reliable basis for the assignation of the new observation. It is possible to consider the case where populations, are involved but, in what follows, we will restrict ourselves to the binary case .

The mathematical problem is to find a “classifier” (or “classification rule”) , with , that minimizes the classification error . It is not difficult to prove (e.g., Devroye et al., 1996, p. 11) that the optimal classification rule (often called “Bayes rule”) is

(1)

where and stands for the indicator function of a set . Of course, since is unknown the exact expression of this rule is usually unknown, and thus different procedures have been proposed to approximate using the training data.

From now on we will use the following notation. Let be the distribution of conditional on , that is, for (the Borel -algebra on ) and . We denote by the support of , for , and (we assume ). Given two measures and , the expression denotes that is absolutely continuous with respect to (i.e., implies ).

The notation stands for the space of real continuous functions on the interval endowed with the usual supremum norm, denoted by . The subspace of functions of class 2 (i.e. with two continuous derivatives) is denoted by .

Finite dimensional spaces. Three classical discrimination procedures

The origin of the discrimination problem goes back to the classical work by Fisher (1936) where, in the -variate framework , a simple “linear classifier” of type was introduced for the case that both populations and are homoscedastic, that is, have a common covariance matrix . Intuitively, is chosen as the affine hyperplane which provides the “maximum separation” between both populations. It is well-known (see, e.g., Duda et al. 2000 for details) that the the expression of Fisher’s rule turns out to depend on the inverse of the covariance matrix. It is also known that Fisher’s linear rule is in fact the optimal one (1) when the conditional distributions of and are homoscedastic normals and all the means and covariances are known. These conditions look quite restrictive but, as argued by Hand (2006) in a provocative paper, Fisher’s rule (or rather its sampling approximation obtained by estimating the unknown parameters) is hard to beat in practical examples. That is, while it is not difficult to construct examples where this rule outrageously fails, its performance is quite good in most cases found in real-life examples. For this reason, Fisher’s linear rule is still the most popular classification tool among practitioners, in spite of the posterior intensive research on this topic. Thus, in a way, Fisher’s rule represents a sort of “golden standard” in the multivariate statistical discrimination problem.

The books by Devroye et al. (1996), Duda et al. (2000) and Hastie et al. (2001) offer different interesting perspectives of the work done in discrimination theory since Fisher’s pioneering paper. All of them focus on the standard multivariate case . Many classifiers have been proposed as an alternative to Fisher’s linear rule in this finite-dimensional setup. One of the simplest and easiest to motivate is the so-called -nearest neighbors method. Fixed a positive integer value (or smoothing parameter) this rule simply classifies an incoming observation in the population if the majority among the training observations closest to (with respect to the considered distance ) belong to . More concretely the -NN rule can be defined by

(2)

where

(3)

and “” means that is one of the nearest neighbors of .

In fact, the definition of the -NN rule is extremely simple and can be introduced (in terms of “majority vote among the neighbors”) with no explicit reference to any regression estimator. However, the idea of replacing the unknown regression function in the optimal classifier (1) with a regression estimator (given by (3) in the case of the -NN rule) is very natural. It suggests a general methodology to construct a wide class of classifiers by just plugging in different regression estimators in (1) instead of the true regression function . In the finite dimensional case this is a particularly fruitful idea, as a wealth of different (parametric and nonparametric) estimators of is available; see Audibert and Tsybakov (2007) for some reasons in favor of the plug-in methodology in classification. The main purpose of this work is to show that the plug-in methodology can be also successfully used for classification in some functional data models.

Discrimination of functional data. Differences with the finite-dimensional case

We are concerned here with the problem of (binary) supervised classification with functional data. That is, we assume throughout that the space where the data live is a separable metric space (typically a space of functions). For some theoretical results, considered below, we will impose more specific assumptions on .

The study of discrimination techniques with functional data is not as developed as the corresponding finite-dimensional theory but, clearly, is one of the most active research topics in the booming field of functional data analysis (FDA). Two well-known books including broad overviews of FDA with interesting examples are Ferraty and Vieu (2006) and Ramsay and Silverman (2005). A recent survey on supervised and unsupervised classification with functional data can be found in Baíllo et al. (2009).

While the formal statement of the functional classification problem is very much the same as that indicated at the beginning of this section, there are some important differences with the classical finite-dimensional case.

  • Lack of a simple functional version of Fisher’s linear rule: As mentioned above, the idea behind Fisher’s rule requires to invert the covariance operator. When this is increasingly difficult as the dimension increases, but it becomes impossible in the functional framework where the operator is typically not invertible. Thus the applicability of Fisher’s linear methodology to functional data is a non-trivial issue of current interest for research. See, for instance, James and Hastie (2001) and Shin (2008) for interesting adaptations of linear discrimination ideas to a functional setting.

  • Difficulty to implement the plug-in idea: Unlike the finite-dimensional case, the plug-in methodology is not generally considered as a standard procedure to construct functional classifiers. When is infinite-dimensional, there are yet few simple parametric models giving a good fit to the regression function and the structure of nonparametric estimators of is relatively complicated.

  • The -NN functional classifier is not universally consistent: In the discrimination problem a sequence of classifiers , based on samples of size , is said to be “consistent” when the corresponding sequence of classification errors converges, as tends to infinity, to the “lowest possible error” attained by the Bayes classifier (1); see Section 3 below for more details. It turns out (see Stone, 1977) that, in the case of finite-dimensional data , any sequence of -NN classifiers is consistent provided that and . Since such consistency holds irrespectively of the distribution of the data , this property is called “universal consistency”.

    The definition of the -NN classifier can be easily translated to the functional setup (by replacing the usual Euclidean distance in with an appropriate functional metric ). However, the universal consistency is lost. Cérou and Guyader (2006, Th. 2) have obtained sufficient conditions for consistency of the -NN classifier when takes values in a separable metric space. Nevertheless, the required assumptions are not always trivial to check. As the -NN rule is a natural “default choice” in infinite-dimensional setups, an important issue is to ensure its consistency, at least for some functional models of practical interest.

The purpose and structure of this paper

This work aims to partially fill the gaps pointed out in the points (b) and (c) of the above paragraph. To this end, in Subsection 2.1 a simple expression is obtained for the Bayes (optimal) rule in the case that both distributions, and , are equivalent. However, turns out to depend on the Radon-Nikodym derivative which is usually unknown, or has an extremely involved expression, even when and are completely known. An interesting exception is given by Gaussian processes with a specific type of covariance functions, called “triangular”. For these processes the Radon-Nikodym derivative has been explicitly calculated by Varberg (1961) and Jørsboe (1968) whose results are collected and briefly commented in Subsection 2.2. In Subsection 2.3 parametric plug-in estimators for are obtained by assuming that and are either (parametric) Brownian motions or Ornstein-Uhlenbeck processes. Non-parametric plug-in estimators for are proposed and analyzed in Subsection 2.4, under the sole assumption that the covariance functions are triangular. Since the proofs of the results in this subsection are rather technical, they are deferred to a final appendix. This concludes our contributions regarding issue (b). Section 3 is devoted to the -NN consistency problem introduced in (c): we use the above-mentioned result by Cérou and Guyader (2006) to show that the -NN rule is consistent in functional classification problems where the data are generated by certain Gaussian triangular processes specified in Subsection 2.2.

Finally, in Section 4 the practical performance of the plug-in rules proposed in Section 2 is checked, and compared with the -NN rule, through a simulation study and the analysis of a real data example.

2 The optimal classifier for a Gaussian family

2.1 A general expression based on Radon-Nikodym derivatives

When the distributions and of and are both absolutely continuous with respect to some common -finite measure , it is easy to see, as a consequence of Bayes formula, that the optimal rule is

(4)

where and , are the -densities of and , respectively.

The expression (4) is particularly important in the finite dimensional problems with , where the Lebesgue measure arises as the natural reference measure and the corresponding Lebesgue densities can be estimated in many ways. In the infinite-dimensional spaces there is no such obvious dominant measure. However if we assume that and , with supports and , are absolutely continuous with respect to each other on , the optimal rule can be also expressed in a simple way with respect to the Radon-Nikodym derivative as shown in the following result.

Theorem 1

Assume that and on . Then

(5)

provides the expression for the optimal rule .

Proof: Define . Then , for , and we can define the Radon-Nikodym derivatives , for . From the definition of the conditional expectation we know that can be expressed by

(6)

Observe that and thus , for . Since and on then, on this set, there exists the Radon-Nikodym derivatives and . In this case, it also holds that , for both and

Then (see, e.g., Folland 1999), for and for -a.e. ,

(7)

Substituting (7) into expression (6) we get (5).

The mutual absolute continuity is not a very restrictive assumption if we deal with Gaussian measures. According to a well-known result by Feldman and Hájek (see Feldman, 1958) for any given pair of Gaussian processes, there is a dichotomy in such a way that they are either equivalent or mutually singular. In the first case both measures and have a common support . As for the identification of the support, Vakhania (1975) has proved that if a Gaussian process, with trajectories in a separable Banach space , is not degenerate (i.e., the distribution of any non-trivial linear continuous functional is not degenerate) then the support of such process is the whole space .

In any case, expression (5) would be of no practical use unless some expressions, reasonably easy to estimate, can be found for the Radon-Nikodym derivative . This issue is considered in the next subsection.

2.2 Explicit expression for a family of Gaussian distributions

The best known Gaussian process is perhaps the standard Brownian motion , for which and the covariance function is . A wide class of Brownian-type processes can be obtained by location and scale changes of type , where is a given mean function and .

In fact, the covariance structure can be generalized to define a much broader class of processes with , where and denote suitable real functions. Covariance functions of this type are called triangular. They have received considerable attention in the literature. For example, Sacks and Ylvisaker (1966) use this condition in the study of optimal designs for regression problems where the errors are generated by a zero mean process with covariance function . It turns out that the Hilbert space with reproducing kernel plays an important role in the results and, as these authors point out, the norm of this space is particularly easy to handle when is triangular. On the other hand, Varberg (1964) has given an interesting representation of the processes , with zero mean and triangular covariance function. This author proved that they can be expressed in the form , where is the standard Wiener process and is a function, of bounded variation with respect to , defined in terms of .

The so-called Ornstein-Uhlenbeck model, for which (), provides another important class of processes with triangular covariance functions. They are widely used in physics and finance.

The following theorem is due to Varberg (1961, Th. 1) and Jørsboe (1968, p. 61). It shows that the Radon-Nikodym derivative can be expressed in a closed, relatively simple way for these special classes of Gaussian processes. For more information concerning explicit expressions of Radon-Nikodym derivatives for Gaussian processes see Segall and Kailath (1975) and references therein. From now on let us denote .

Theorem 2

Let . Assume that , for , are Gaussian processes on , with covariance functions , for , where , for , are positive functions in . Assume also that , for , and are bounded away from zero on , that and that if and only if .

  1. Assume that , for . Then there exist some constants and a function , whose expressions are given in the proof, such that

    (8)
  2. Assume now that the covariance functions are identical, i.e. and for , that , is a function , such that whenever . Then there exist some constants and a function , whose expressions are given in the proof, such that

    (9)

Proof:

  1. Varberg (1961, Th. 1) shows that, under the assumptions of (a), and are equivalent measures. The Radon-Nikodym derivative of with respect to is

    (10)

    where

    and .

    Observe that, by the assumptions of the theorem, is differentiable with bounded derivative. Thus is of bounded variation and it may be expressed as the difference of two bounded positive increasing functions. Therefore the stochastic integral (10) is well defined and it can be evaluated integrating by parts, leading to conclusion (8), with and .

  2. In Jørsboe (1968), p. 61, it is proved that, under the indicated assumptions, and are equivalent measures with Radon-Nikodym derivative

    with

    and . Again, the integration by parts gives (9), where .

In the general case where and , let us denote by the distribution of the Gaussian process with mean and covariance function . Then, applying the chain rule for Radon-Nikodym derivatives (see, e.g., Folland, 1999) we get

(11)

Under the appropriate assumptions the expressions of the Radon-Nikodym derivatives in the right-hand side of (11) are given in (8) and (9).

2.3 Parametric plug-in rules

The aim of this subsection is twofold. First and foremost, we show how the theoretical results of Subsections 2.1 and 2.2 become useful in practice. To this end, we consider examples of well-known Gaussian processes that fulfill the requirements of Theorems 1 and 2, namely Brownian motions with drift and Ornstein-Uhlenbeck processes. We derive the expressions of the Radon-Nikodym derivatives for these examples. Then, it is straightforward to compute the Bayes rule for classification between two elements of one of these families. In these particular examples the mean and variance of the Gaussian process have known parametric expressions (up to a finite number of parameters). Thus is completely specified as long as the parameters have known values. When this is not the case, we can substitute each unknown parameter in by some estimate. The resulting discrimination procedure is called the parametric plug-in rule. In particular, for the Bayes rules given in (12), (13), (14) and (15) below the explicit expression of the parameter estimates is given in the appendix.

The second objective of Subsection 2.3 is to obtain the expressions of the Bayes rules for the models used in Section 4 and to derive the corresponding parametric plug-in versions.

Two Brownian motions

Let us denote . In the Brownian case, using the standard notation in stochastic differential equations, is just the solution of , for and . Here , , is a constant, and are two uncorrelated Brownian motions and . Then, if , the conditions of Theorem 2 are satisfied with and , for .

When , we have and, for any ,

Thus the Bayes rule is

(12)

If for , then is random and a similar calculation yields that the Bayes rule classifies in population whenever

(13)

Replacing the unknown parameters, , and in (12) and (13) by estimates, we obtain the corresponding parametric plug-in rules.

When , then , , for , and the hypothesis in Theorem 2 is not satisfied. In fact, if this last equality does not hold, by Theorem 1 in Varberg (1961) we know that and are mutually singular.

Two Ornstein-Uhlenbeck processes

Let , for , be Ornstein-Uhlenbeck processes given by

where and are two independent Brownian motions and , , are constants.

If is equal to a constant , we have that and . Fixing , we get and for . The condition in Theorem 2 is fulfilled if and only if . Also, since , then has to be 0 for . Then it is straightforward to check that the Bayes rule classifies in population if

(14)

When is random, it follows a normal distribution with mean and variance . Then , for all , and , and . Consequently, the Bayes rule assigns to population if

(15)

The parametric plug-in classification rule is derived by substituting the unknown parameters , and , , in (14) and (15) with their corresponding estimators.

2.4 Nonparametric plug-in rules

In this section we analyze the situation in which the processes ultimately belong to the Gaussian family fulfilling the conditions of Theorem 2, but we do not place any parametric assumption on the mean and the covariance functions. However, let us note that, until we get to the estimation of the Radon-Nikodym derivatives, the Gaussianiaty assumption is not needed. Specifically, we only assume that the covariance functions of the involved processes are of type , for some (unknown) real functions , where is bounded away from 0 on the interval .

Observe that, in order to use a plug-in version of the optimal classification rule along the lines of Theorems 1 and 2, we need to estimate the functions , and as well as their first and second derivatives. Since these estimation problems have some independent interest, in this subsection we consider them in a general setup, not necessarily linked to the classification problem. Thus we use the ordinary iid sampling model with a fixed sample size denoted, for simplicity, by in all cases.

Regarding and , let us note that the condition , for , entails and if . However, it is clear that these conditions only determine and up to multiplicative constants so that one can impose (without loss of generality) the additional assumption . Thus, it turns out that and can be uniquely determined in terms of and . Our study will require three steps: first, the estimation of the mean function and its derivatives, then the analogous study for , and and, finally, the analysis of more involved functions defined in terms of these.

In Propositions 1 to 3 below we assume that the sample data are , iid trajectories of a process in the space , endowed with the supremum norm, .

Estimation of the mean and covariance functions and their derivatives

To estimate the mean function and its derivatives, we will only need to assume that  satisfies that , which (see p. 172 in Araujo and Giné, 1980) implies that the distribution of satisfies the Central Limit Theorem (CLT) in .

The natural estimator of is the sample mean, denoted by . Since the derivatives of are also involved in the expressions of the Radon-Nikodym derivatives obtained in Theorem 2, we will also need to consider the estimation of and . Our estimators will depend on a given sequence of smoothing parameters. Given , define

For , we define

where . The definition of and on is similar. These definitions allow us to handle analogously the extreme points and the inner ones. Thus we will not pay special attention to the extreme points in the proofs.

There is a slight notational abuse in these definitions as, for example, is not the derivative of but an estimator of . We keep this notation throughout the manuscript for simplicity.

As mentioned at the beginning of this section, due to the triangular structure of , in principle we should only concentrate on the estimation of the functions and and their derivatives. However, due to technical reasons we will also need to consider the function and its derivatives. Natural nonparametric estimators of these functions can be given in terms of the empirical covariance

The estimation of the required derivatives is carried out in an analogous way as we did with the mean function. Observe finally that, since , we can estimate by for any and similarly for its first two derivatives. Regarding the function , we estimate by .

Proposition 1

Let  be iid trajectories in of a process such that and whose mean function has a Lipschitz second derivative.

  1. For the mean estimation problem we have,

    (16)
    (17)
    (18)
  2. Assume that and that the functions , and admit Lipschitz second order derivatives. Then, we have

    (19)
    (20)
    (21)

    Similar results also hold for and .

From the proof of this proposition (see the Appendix) it can be checked that the assumption can be replaced with , for some , and for any .

Estimation of

The estimation of is harder than that of . It will be useful to distinguish two cases, where the estimators must be defined in different ways. In the case (corresponding to the case ) we have which is estimated by

(22)

When (which implies that ), the estimator proposed in (22) is, at best, highly unstable. This case is not unusual: see, for instance, the examples introduced in Subsection 2.3 when is constant. For the sake of simplicity from now on assume that for .

The first step is to define for , where is a sequence of positive numbers converging to zero (whose rate will be determined later). Then we define estimates for the first and the second derivatives of on the same interval. The structure of as a quotient suggests defining, on ,

where ,

Now we complete the definition of our estimator of on the whole interval by using a Taylor-kind expansion on ,

(23)

Finally, take

Proposition 2

Let the assumptions of Proposition 1 (b) hold.

  1. If then the rate of convergence of , and are the same as those of (19), (20) and (21), respectively.

  2. If assume that and for every . Let be such that . Then