Fast Dd-classification of functional data
A fast nonparametric procedure for classifying functional data is introduced. It consists of a two-step transformation of the original data plus a classifier operating on a low-dimensional space. The functional data are first mapped into a finite-dimensional location-slope space and then transformed by a multivariate depth function into the -plot, which is a subset of the unit square. This transformation yields a new notion of depth for functional data. Three alternative depth functions are employed for this, as well as two rules for the final classification in . The resulting classifier has to be cross-validated over a small range of parameters only, which is restricted by a Vapnik-Chervonenkis bound. The entire methodology does not involve smoothing techniques, is completely nonparametric and allows to achieve Bayes optimality under standard distributional settings. It is robust, efficiently computable, and has been implemented in an R environment. Applicability of the new approach is demonstrated by simulations as well as by a benchmark study.
Keywords: Functional depth; Supervised learning; Central regions; Location-slope depth; DD-plot; Alpha-procedure; Berkeley growth data; Medflies data.
The problem of classifying objects that are represented by functional data arises in many fields of application like biology, biomechanics, medicine and economics. Ramsay & Silverman (2005) and Ferraty & Vieu (2006) contain broad overviews of functional data analysis and the evolving field of classification. At the very beginning of the 21st century many classification approaches have been extended from multivariate to functional data: linear discriminant analysis (James & Hastie, 2001), kernel-based classification (Ferraty & Vieu, 2003), -nearest-neighbours classifier (Biau et al., 2005), logistic regression (Leng & Müller, 2006), neural networks (Ferré & Villa, 2006), support vector machines (Rossi & Villa, 2006). Transformation of functional data into a finite setting is done by using principal and independent (Huang & Zheng, 2006) component analysis, principal coordinates (Hall et al., 2001), wavelets (Wang et al., 2007) or functions of very simple and interpretable structure (Tian & James, 2013), or some optimal subset of initially given evaluations (Ferraty et al., 2010; Delaigle et al., 2012).
Generally, functional data is projected onto a finite dimensional space in two ways: by either fitting some finite basis or using functional values at a set of discretization points. The first approach accounts for the entire functional support, and the basis components can often be well interpreted. However, the chosen basis is not per se best for classification purposes, e.g., Principal Component Analysis (PCA) maximizes dispersion but does not minimize classification error. Moreover, higher order properties of the functions, which are regularly not incorporated, may carry information that is important for classification (see Delaigle & Hall, 2012, for discussion). The second approach appears to be natural as the finite-dimensional space is directly constructed from the observed values. But any selection of discretization points restricts the range of the values regarded, so that some classification-relevant information may be lost. Also, the data may be given at arguments of the functions that are neither the same nor equidistant nor enough frequent. Then some interpolation is needed and interpolated data instead of the original one are analyzed. Another issue is the way the space is synthesized. If it is a heuristic (as in Ferraty et al., 2010), a well classifying configuration of discretization points may be missed. (To see this, consider three discretization points which jointly discriminate well in constructed from their evaluations, but which cannot be chosen subsequently because each of them has a relatively small discrimination power compared to some other available discretization points.) To cope with this problem, Delaigle et al. (2012) consider (almost) all sets of discretization points that have a given cardinality; but this procedure involves an enormous computational burden, which restricts its practical application to rather small data sets.
López-Pintado & Romo (2006), Cuevas et al. (2007), Cuesta-Albertos & Nieto-Reyes (2010) and Sguera et al. (2014) have introduced nonparametric notions of data depth for functional data classification (see also Cuesta-Albertos et al., 2014). A data depth measures how close a given object is to an - implicitly given - center of a class of objects; that is, if objects are functions, how central a given function is in an empirical distribution of functions.
Specifically, the band depth (López-Pintado & Romo, 2006) of a function in a class of functions indicates the relative frequency of lying in a band shaped by any functions from , where is fixed. Cuevas et al. (2007) examine five functional depths for tasks of robust estimation and supervised classification: the integrated depth of Fraiman & Muniz (2001), which averages univariate depth values over the function’s domain; the -mode depth, employing a kernel; the random projection depth, taking the average of univariate depths in random directions; and the double random projection depths that include first derivatives and are based on bivariate depths in random directions. Cuesta-Albertos & Nieto-Reyes (2010) classify the Berkeley growth data (Tuddenham & Snyder, 1954) by use of the random Tukey depth (Cuesta-Albertos & Nieto-Reyes, 2008). Sguera et al. (2014) introduce a functional spatial depth and a kernelized version of it. Cuesta-Albertos et al. (2014) extensively study applications of the -plot to functional classification, employing a variety of functional depths.
There are several problems connected with the depths mentioned above. First, besides double random projection depth, the functions are treated as multivariate data of infinite dimension. By this, the development of the functions, say in time, is not exploited. These depth notions are invariant with respect to an arbitrary rearrangement of the function values. Second, several of these notions of functional depth break down in standard distributional settings, i.e. the depth functions vanish almost everywhere (see Chakraborty & Chaudhuri, 2014; Kuelbs & Zinn, 2013). Eventually, the depth takes empirical zero values if the function’s hyper-graph has no intersection with the hypo-graph of any of the sample functions or vice versa, which is the case for both half-graph and band depths and their modified versions, as well as for the integrated depth. If a function has zero depth with respect to each class it is mentioned as an outsider, because it cannot be classified immediately and requires an additional treatment (see Lange et al., 2014a; Mozharovskyi et al., 2015).
1.1 Two benchmark problems
Naturally, the performance of a classifier and its relative advantage over alternative classifiers depend on the actual problem to be solved. Therefore we start with two benchmark data settings, one fairly simple and the other one rather involved.
First, a popular benchmark set is the growth data of the Berkeley Growth Study (Tuddenham & Snyder, 1954). It comprises the heights of 54 girls and 39 boys measured at 31 non-equally distant time points, see Figure 1, left. This data was extensively studied in Ramsay & Silverman (2005). After having become a well-behaved classic it has been considered as a benchmark in many works on supervised and unsupervised classification, by some authors also in a data depth context. Based on the generalized band depth López-Pintado & Romo (2006) introduce a trimming and weighting of the data and classify them by their (trimmed) weighted average distance to observation or their distance to the trimmed mean. Cuevas et al. (2007) classify the growth data by their maximum depth, using the five above mentioned projection-based depths and as a reference (see also Cuesta-Albertos & Nieto-Reyes, 2010). Recently, Sguera et al. (2014) apply functional spatial depth to this data set, and Cuesta-Albertos et al. (2014) perform -plot classification.
Second, we consider a much more difficult task: classifying the medflies data of Carey et al. (1998), where 1000 30-day (starting from the fifth day) egg-laying patterns of Mediterranean fruit fly females are observed. The classification task is to explain longevity by productivity. For this a subset of 534 flies living at least 34 days is separated into two classes: 278 living 10 days or longer after the 34-th day (long-lived), and 256 those have died within these 10 days (short-lived), which are to be distinguished using daily egg-laying sequences (see Figure 1, right, with the linearly interpolated evaluations). This task is taken from Müller & Stadtmüller (2005), who demonstrate that the problem is quite challenging and cannot be satisfactorily solved by means of functional linear models.
1.2 The new approach
We shall introduce a new methodology for supervised functional classification covering the mentioned issues and validate it with the considered real data sets as well as with simulated data. Our approach involves a new notion of functional data depth. It is completely non-parametric, oriented to work with raw and irregularly sampled functional data, and does not involve heavy computations. It extends the idea of the location-slope depth (see Mosler & Polyakova, 2012), and employs the depth-to-depth classification technique proposed by Li et al. (2012), the -classification of Lange et al. (2014a) and the depth-based , first suggested by Vencálek (2011).
Clearly, as any statistical methodology for functional data, our classification procedure has to map the relevant features of the data to some finite dimensional setting. For this we map the functional data to a finite-dimensional location-slope space, where each function is represented by a vector consisting of integrals of its levels (‘location’) and first derivatives (‘slope’) over respectively equally sized subintervals.
We interpolate the functions linearly; hence their levels are integrated as piecewise-linear functions, and the derivatives as piecewise constant ones. Then we classify the data within the location-slope space using a proper depth-based technique. We restrict by a Vapnik-Chervonenkis bound and determine and by cross-validation of the final classification error, which is very quickly done. Thus, the loss of information caused by the finite-dimensional representation of data, which cannot be avoided, is measured by the final error of our classification procedure and minimized. This contrasts standard approaches using PCA projections or projections on spline bases, which penalize the loss of information with respect to a functional subspace but not to the ability of discrimination.
To construct the classification rule, the training classes in -space are transformed to a depth-depth plot (-plot, see Li et al., 2012), which is a two-dimensional set of points, each indicating the depth of a data point regarding the two training classes. Finally, the classes are separated on the -plot by either applying a -nearest-neighbor () rule (Vencálek, 2011) or the -procedure (Lange et al., 2014a).
The resulting functional data depth, called integral location-slope depth has a number of advantages. It is based on a linear mapping which does not introduce any spurious information and can preserve Bayes optimality under standard distributional assumptions. Already in the first step, the construction of the finite-dimensional space aims at the minimization of an overall goal function, viz. the misclassification rate. The simplicity of the mapping allows for its efficient computation.
The procedure has been implemented in the R-package ddalpha.
The rest of the paper is organized as follows: Section 2 reviews the use of data depth techniques in classifying finite-dimensional objects. Section 3 presents the new two-step representation of functional data, first in a finite-dimensional Euclidean space (the location-slope space), and then in a depth-to-depth plot (-plot). In Section 4 we briefly introduce two alternative classifiers that operate on the -plot, a nearest-neighbor procedure and the -procedure. Section 5 treats some theoretical properties of the developed classifiers. Section 6 provides an approach to bound and select the dimension of the location-slope space. In Section 7 our procedure is applied to simulated data and compared with other classifiers. Also, the above two real data problems are solved and the computational efficiency of our approach is discussed. Section 8 concludes. Implementation details and additional experimental results are collected in the Appendix.
2 Depth based approaches to classification in
For data in Euclidean space many special depth notions have been proposed in the literature; see, e.g., Zuo & Serfling (2000) for definition and properties. Here we mention three depths, Mahalanobis, spatial and projection depth. These depths are everywhere positive. Hence they do not produce outsiders, that is, points having zero depth in both training classes. This no-outsider property appears to be essential in obtaining functional depths for classification.
For a point and a random vector having an empirical distribution on in the Mahalanobis depth (Mahalanobis, 1936) of w.r.t. is defined as
where measures the location of , and the scatter.
where denotes the Euclidean norm, for and , and is the covariance matrix of . and can be efficiently computed in .
The projection depth (Zuo & Serfling, 2000) of regarding is given by
where denotes the univariate median and the median absolute deviation from the median. Exact computation of is, in principle, possible (Liu & Zuo, 2014) but practically infeasible when . Obviously, is approximated from above by calculating the minimum of univariate projection depths in random directions. However, as is piece-wise linear (and, hence, attains its maximum on the edges of the direction cones of constant linearity), a randomly chosen direction yields the exact depth value with probability zero. For a sufficient approximation one needs a huge number of directions, each of which involves the calculation of the median and MAD of a univariate distribution.
To classify objects in Euclidean space , the existing literature employs depth functions in principally two ways:
Classify the original data by their maximum depth in the training classes.
Transform the data by their depths into a low-dimensional depth space, and classify them within this space.
Ad 1: Ghosh & Chaudhuri (2005) propose the maximum depth classifier, which assigns an object to the class in which it has maximum depth. In its naive form this classifier yields a linear separation. Maximum-depth classifiers have a plug-in structure; their scale parameters need to be tuned (usually by some kind of cross-validation) over the whole learning process. For this, Ghosh & Chaudhuri (2005) combine the naive maximum-depth classifier with an estimate of the density. A similar approach is pursued with projection depth in Dutta & Ghosh (2012a) and depth in Dutta & Ghosh (2012b), yielding competitive classifiers.
Ad 2: Depth notions are also used to reduce the dimension of the data. Li et al. (2012) employ the -plot, which represents all objects by their depth in the two training classes, that is, by points in the unit square. (The same for training classes in the -dimensional unit cube.) To solve the classification task, some separation rule has to be constructed in the unit square. Li et al. (2012) minimize the empirical risk, that is the average classification error on the training classes, by smoothing it with a logistic sigmoid function and, by this, obtain a polynomial separating rule. They show that their approach (with Mahalanobis, projection and other depths) asymptotically achieves the optimal Bayes risk if the training classes are strictly unimodal elliptically distributed. However, in practice the choice of the smoothing constant and non-convex optimization, potentially with many local minima, encumber its application. In Lange et al. (2014a) these problems are addressed via the -procedure, which is very fast and speeds up the learning phase enormously. Use of other procedures in the -plot is discussed in Cuesta-Albertos et al. (2014).
3 A new depth transform for functional data
Let be the space of real functions, defined on a compact interval, which are continuous and smooth at all points besides a finite set, and let be endowed with the supremum norm. The data may be given either as observations of complete functions in or as functional values at some discretization points, in general neither equidistant nor common ones. If the functional data is given in discretized form, it is usually interpolated by splines of some order (see Ramsay & Silverman, 2005), so that sufficiently smooth functions are obtained. Here we use linear interpolation (that is splines of order 1) for the following reasons. Firstly, with linear interpolation, neither the functions nor their derivatives need to be smoothed. Thus almost any raw data can be handled easily. E.g., the medflies data, as to the egg-laying process, is naturally discrete; see Figure 1 (right). Secondly, higher order splines increase the computational load, especially when the number of knots or the smoothing parameter are to be determined as part of the task. Thirdly, splines of higher order may introduce spurious information.
We construct a depth transform as follows. In a first step, the relevant features of the functional data are extracted from into a finite-dimensional Euclidean space , which we call the location-slope space (Section 3.1). Then an -dimensional depth is applied to the transformed data yielding a -plot in the unit square (Section 3.2), which represents the two training classes. Finally the separation of the training classes as well as the classification of new data is done on the -plot (Section 4).
3.1 The location-slope transform
We consider two classes of functions in , and , which are given as measurements at ordered points , ,
Assume w.l.o.g. and let . From a function is obtained as follows. Connect the points with line segments and set when and when . By this, the data become piecewise linear functions on , and their first derivatives become piecewise constant ones, see Figure 2, left.
A finite-dimensional representation of the data is then constructed by integrating the interpolated functions over subintervals (location) and their derivatives over subintervals (slope), see Figure 2 (left). Thus, our location-slope space has dimension . It delivers the following transform,
That is, the average values and slopes constitute a point in . Either or must be positive, . In case or the formula (3.1) is properly modified. Put together we obtain a composite transform ,
For example, choose and for the growth data. Then they are mapped into the location-slope space , which is shown in Figure 2 (right). Here the functions’ first derivatives are averaged on two half-intervals. That is, for each function two integrals of the slope are obtained: the integral over and that over . Here, the location is not incorporated at all. Figure 2, left, exhibits the value (height) and first derivative (velocity) of a single interpolated function, which is then represented by the average slopes on the two half-intervals, yielding the rectangular point in Figure 2 (right).
Further, given the data, and can be chosen large enough to reflect all relevant information about the functions. (Note that the computational load of the whole remaining procedure is linear in dimension .) If and are properly chosen, under certain standard distributional assumptions, the -transform preserves asymptotic Bayes optimality; see Section 5 below.
Naturally, only part of these intervals carries the information needed for separating the two classes. Delaigle et al. (2012) propose to determine a subset of points in based on which the training classes are optimally separated. However they do not provide a practical procedure to select these points; in applications they use cross-validation. Generally, we have no prior reason to weight the intervals differently. Therefore we use intervals of equal length, but possibly different ones for location and slope.
The question remains how many equally sized subintervals, for location and for slope, should be taken. We will see later in Section 7 that our classifier performs similar with the three depths when the dimension is low. In higher dimensions the projection depth cannot be computed precisely enough, so that the classifier becomes worse. The performance is most influenced by the construction of the location-slope transform, that is, by the choice of the numbers and . We postpone this question to Section 6.
3.2 The -transform
Denote the location-slope-transformed training classes in by and . The -plot is then
Here is an -dimensional depth. In particular, may be , or , each of which does not produce outsiders. The -plots of these three for growth data, taking and , are pictured in Figure 3. Clearly, in this case, almost faultless separation is achievable by drawing a straight line through the origin.
4 -plot classification
The training phase of our procedure consists in the following. After the training classes have been mapped from to the -plot as described in Section 3, a selector is determined in the -plot that separates the -transformed data. For the latter, we consider two classifiers operating on the -plot, and compare their performance. Firstly we propose the classifier. It converges to the Bayes rule under standard elliptical models; see Section 4.1. Note that needs to be cross-validated on the -plot, which is computationally expensive. Therefore, secondly, we employ the -procedure, which is a very fast heuristic; see Section 4.2. Although its theoretical convergence is not guaranteed (see Lange et al., 2014a), the -classifier performs very well in applications.
4.1 -classification on the -plot
When applied to multivariate data, (under mild assumptions) is known to be a consistent classifier. By multiplying all distances with the inverse covariance matrix of the pooled data an affine invariant version is obtained. In our procedure we employ an affine-invariant classifier on the -plot. It will be shown (Section 5) that, if the underlying distribution of each class is strictly unimodal elliptical with the same radial density, the classifier, operating on the -plot, achieves asymptotically the optimal Bayes risk.
Given a function to be classified, we represent (according to Section 3) as and then as . According to on the -plot is classified as follows:
where denotes the indicator function of a set , and is the neighborhood of defined by the -closest observations. We use the -distance on the -plot; thus is the smallest rectangle centered at that contains training observations. In applications has to be chosen, usually by means of cross-validation.
The second classification approach is the -classifier, introduced in Lange et al. (2014a). It uses a projective-invariant method called the -procedure (Vasil’ev & Lange, 1998), which is a heuristic classification procedure that iteratively decreases the empirical risk. The -classifier employs a modified version of the -procedure to construct a nonlinear separator in that classifies the depth-represented data points. The construction is based on depth values and the products of depth values (= D-features) up to some degree that can be either chosen a priori or determined by cross-validation. In a stepwise way, linear discrimination is performed in subspaces of the feature space. In each step a pair of -features is replaced by a new -feature as long as the average misclassification rate decreases and basic -features are left to be replaced. For details, we refer to Lange et al. (2014a).
We employ three depths (Mahalanobis, spatial, and projection depths), which are positive on the whole and thus do not produce outsiders. It is known that the -classifier is asymptotical Bayes-optimal for the location-shift model (see Lange et al., 2014a) and performs well for broad classes of simulated distributions and a wide variety of real data (see Lange et al., 2014b; Mozharovskyi et al., 2015). The main advantage of the -classifier is its high training speed, as it contains the -procedure which, on the -plot, has the quick-sort complexity and proves to be very fast. The separating polynomial is constructed by space extensions of the -plot (which is of low dimension) and cross-validation.
5 Some theoretical properties of the classifiers
In this section, we treat optimality properties of classification procedures that operate on a DD-plot of -transformed functional data. Besides -classification and -procedure they include linear and quadratic discriminant analysis as well as maximum depth classification, all applied to the DD-plotted data. The latter procedures will be included in the comparative study of Section 7. First, in Subsection 5.1, it is shown that well-known properties regarding the Bayes optimality of classifiers for multivariate data carry over to -transformed functional data if the discretization points are fixed and is chosen large enough. Second, in Subsection 5.2, we introduce a sampling scheme that evaluates the functions at an increasing number of points, and establish Bayes optimality in the Gaussian case.
Let be the space of real functions, defined on the finite interval , which are continuous and piecewise smooth. Consider two stochastic processes, and on a common probability space , whose paths are in with probability 1. (For conditions on Gaussian processes that yield smooth paths, see e.g. Cambanis (1973).)
5.1 Fixed-discretization optimality
Let be a given finite set of discretization points, and for each let be its linear interpolation based on , as described in Section 3. Denote the space of these linear interpolations by . Then is a finite-dimensional linear space, which is isomorphic to . Consider two samples of independent random vectors in , , distributed as , , and , distributed as , , where and are the -marginal distributions of resp. in , and let the two samples be independent from each other.
Proposition 1 (Bayes optimality on )
Consider a class of decision rules and assume that contains a sequence whose risk converges in probability to the Bayes risk. Then there exists a pair so that the same class of decision rules operating on -transformed data in contains a sequence whose risk also converges in probability to the Bayes risk.
We say that a sequence of decision rules is Bayes optimal if their risk converges in probability to the Bayes risk. The proof of Proposition 1 is obvious: E.g., choose and . Then the Bayes optimality trivially carries over from to , hence to .
The Proposition allows for procedures that achieve error rates close to minimum. However note that it does not refer to Bayes optimality of classifying the underlying process, but just of classifying the -dimensional marginal distributions corresponding to . In particular, if two processes have equal marginal distributions on , no discrimination is possible.
Corollary 1 (Optimality of and )
Let and be Gaussian processes and the priors of class membership be equal, and let proper and be selected as above. The following rules operating on -transformed data are Bayes optimal, relative to the -marginals:
quadratic discriminant analysis (),
linear discriminant analysis (), if and have the same covariance function.
For a definition of LDA and QDA, see, e.g., Devroye et al. (1996), Ch. 4.4). Proof: As the processes are Gaussian, we obtain that
The following result is taken from Lange et al. (2014a). Let be a hyperplane in , and let denote the projection on . A probability distribution is mentioned as the mirror image regarding of another probability distribution in if for and it holds: is distributed as .
Proposition 2 (Lange et al. (2014a))
Let and be probability distributions in having densities and , and let be a hyperplane such that is the mirror image of with respect to and in one of the half-spaces generated by . Then, based on a 50:50 independent sample from and , the -procedure will asymptotically yield the linear separator that corresponds to the bisecting line of the -plot.
Notice that due to the mirror symmetry of the distributions in the -plot is symmetrically distributed as well. Symmetry axis is the bisector, which is obviously the result of the -procedure when the sample is large enough. Then the -rule corresponds to the Bayes rule. In particular, the requirements of the proposition are satisfied if and are mirror symmetric and unimodal.
A stochastic process is mentioned as a strictly unimodal elliptical process if its finite-dimensional marginals are elliptical with the same radial density , that is , and is strictly decreasing. Here, denotes the shift vector and the structural matrix.
Corollary 2 (Optimality of - and -rules)
Assume that the processes and are strictly unimodal elliptical and have the same radial density, and choose and as above. Then
(a) the following rules operating on -transformed data are Bayes optimal, relative to the -marginals:
as with and .
(b) If, in addition, the structural matrices coincide, , and the priors of class membership are equal, then the following rules operating on -transformed data are Bayes optimal, relative to the -marginals:
the maximum-depth rule,
the - rule,
the - rule,
the - rule,
Proof: Parts (i) to (iii) follow from Proposition 1 above and Theorem 3.5 in Vencálek (2011). Parts (v) to (vii) of the Corollary 2 are a consequence of Proposition 1 and Proposition 2. Part (iv) is deduced from Proposition 1 and Ghosh & Chaudhuri (2005), who demonstrate that the maximum-depth rule is asymptotically equivalent to the Bayes rule if the two distributions have the same prior probabilities and are elliptical in with only a location difference.
The above optimality results regard rules that operate on a DD-plot of -transformed data. For comparison, we will also apply the -rule directly to -transformed data.
Corollary 3 (Optimality of )
The -classifier applied to -transformed data (where and are selected as above) is Bayes optimal, relative to the -marginals.
5.2 Functional optimality
Next let us consider the asymptotic behavior of our procedure regarding the functions themselves. The above results refer to Bayes optimality with respect to the -dimensional marginal distributions of the two processes that correspond to a fixed set of discretization points. Obviously they allow no inference on the whole functions.
In order to classify asymptotically with respect to the processes, we introduce a sequential sampling scheme that evaluates the functions at an increasing number of discretization points. For this, a slightly different notation will be needed.
For each consider a set of discretization points, , such that , , , for all , and
For assume that are i.i.d. sampled paths from , and that each is observed at times , . Similarly assume that are i.i.d. from , each being observed at times , and that the two samples are independent of each other.
In this setting, we apply our classification procedure with and depending on . By cross-validation we select an optimal from the set
For , let be the linearly interpolated function that is supported at , , , and define similarly. denotes the empirical distribution on , and and that on . Then, obviously, the classification based on the -transformation, with being selected from via cross validation, is optimal regarding the empirical distributions and .
Now we can extend Corollary 1 to the case of process classification:
Theorem 1 (Bayes optimality with respect to Gaussian processes)
Let and be Gaussian processes that have piecewise smooth paths, and let the priors of class membership be equal. Under the above sequential sampling scheme and if is selected from , the following rules operating on the -transformed data are almost surely Bayes optimal regarding the processes and :
quadratic discriminant analysis (QDA),
linear discriminant analysis (LDA), if and have the same covariance function.
Proof: As the processes are Gaussian, we obtain that
where , and the are -covariance matrices, . Observe that the optimal (i.e. crossvalidated) error rate converges to the probability of error. If is selected from , then, as in Corollary 1, the Bayes risk is achieved with respect to the empirical distributions. Theorem 1(ii) of Nagy & Gijbels & Hlubinka (2015) says that under the above sampling scheme it holds
As for every the Bayes risk is achieved with respect to and , (15) implies that, in the limit (), the Bayes risk is almost surely achieved with respect to the processes and .
6 Choosing the dimensions and
Clearly, the performance of our classifier depends on the choice of and , which has still to be discussed. In what follows we assume that the data is given as functional values at a (usually large) number of discretization points. Let denote the number of these points.
Delaigle et al. (2012) propose to perform the classification in a finite-dimensional space that is based on a subset of discretization points selected to minimize the average error. But these authors do not offer a construction rule for the task but rely on multi-layer leave-one-out cross-validation, which is very time-consuming. Having recognized this problem they suggest some time-saving modifications of the cross-validation procedure. Clearly, the computational load is then determined by the cross-validation scheme used. It naturally depends on the size of the data sample, factored with the training time of the finite-dimensional classifier, which may also include parameters to be cross-validated.
The Delaigle et al. (2012) approach (abbreviated crossDHB for short) suggests a straightforward procedure for our problem of constructing an -space: allow for a rich initial set of possible pairs and then use cross-validation in selecting a pair that (on an average) performs best. The initial set shall consist of all pairs , say, satisfying . Other upper bounds may be used, e.g. if exceeds the number of observations. In addition, a dimension-reducing technique like PCA or factor analysis may be in place. But this cross-validation approach (crossLS for short), similar to the one of Delaigle et al. (2012), is still time-consuming; see Section 7.4 for computing times. The problem of selecting an appropriate pair remains challenging.
In deciding about and , we will consider the observed performance of the classifier as well as some kind of worst-case performance. Fortunately, the conservative error bound of Vapnik and Chervonenkis (Vapnik & Chervonenkis, 1974), see also Devroye et al. (1996), provides some guidance. The main idea is to measure how good a certain location-slope space can be at all, by referring to the worst-case empirical risk of a linear classifier on the training classes. Clearly, the class of linear rules is limited, but may be regarded as an approximation. Also, this limitation keeps the deviation from empirical risk small and allows for its implicit comparison with the empirical risk itself. Moreover, in choosing the dimension of the location-slope space we may adjust for the required complexity so that finally the separation rule is close to a linear rule.
Here, linear discriminant analysis (LDA) is used for the linear classification. Although other approaches like perceptron, regression depth, support vector machine, or -procedure can be employed instead, we use LDA as it is fast and linear. Below we derive the Vapnik-Chervonenkis conservative bound for our particular case; this is similar to Theorem 5.1 in Vapnik & Chervonenkis (1974). Here we use another bound on the cardinality of the class of separating rules (see also Devroye et al., 1996).
Consider the -transformed data and , as two samples of random vectors that are distributed as and , respectively. Let be a class of separation rules , where two different rules separate into two different subsets. Then each produces an expected classification error,
with and being prior probabilities of and , and being an indicator function. As mentioned above, can be consistently estimated by leave-one-out cross-validation from and . However, this estimate comes at substantial computational cost. On the other hand, based on the empirical risk , the expectation can be bounded from above. The empirical risk is given by
As discussed before, we constrain to be the class of linear rules. Then the number of possible rules equals the number of different separations of by hyperplanes . Note that is the number of possible different separations of points in that can be achieved by hyperplanes containing the origin. Consequently, it holds , and equality is attained if are in general position. Applying Hoeffding’s (1963) inequality, the probability that the highest (over the class ) excess of the classification error over the empirical risk will not exceed some positive value can be bounded from above:
This allows to formulate the following proposition:
For the class of linear separating rules in , it holds with probability :
Proof: Requiring yields . Solving the last w.r.t. gives . Then, with probability , equation (19) follows immediately.
The probability is fixed to some small constant. Here we use , given the sample size. The goodness of a certain choice of and is then measured by
We restrict our choice of and to those that make small. An intuitive justification is the following: refers to empirical risk, while the second term penalizes the dimension, which balances fit and complexity.
To find out whether the proposed bound really helps in finding proper dimensions and , we first apply our approach to two simulated data settings of Cuevas et al. (2007), called ‘Model 1’ and ‘Model 2’ (both ). The data generating process of Cuevas et al. (2007) is described in Section 7.2 below. We determine and , use the Mahalanobis depth to construct a -plot and apply the -classifier, which is abbreviated in the sequel as -. For each pair with and , the risk bound and the average classification error (ACE) are calculated by averaging over 100 takes and plotted in Figure 4. Note that these patterns look similar when spatial or projection depth is used. Further, for the two benchmark data problems, we estimate ACE by means of leave-one-out cross-validation (see Figure 5). Here all -pairs are considered with and .
As expected, Figures 4 and 5 largely support the statement “the less the smaller the ACE”. Although for the challenging medflies data (Figure 5, right) the plot remains a fuzzy scatter, a substantial part of points with small error have low values of . (Even more, the -pair with the smallest actually corresponds to the smallest error, see the point in the lower left corner.) Thus still guides us in configuring the location-slope space, as it is affirmed by our experimental results in Section 7.3 below. Computing involves a single calculation of the LDA-classifier (viz. to estimate ), which is done in negligible time. Then, the -pair with the smallest can be picked. We restrict attention to a few -pairs that deliver smallest values of , and choose the best one by cross-validating over this tiny range. This new technique is employed here for space building. We abbreviate it as VCcrossLS.
7 Experimental comparisons
To evaluate the performance of the proposed methodology we compare it with several classifiers that operate either on the original data or in a location-slope space. After introducing those competitors (Section 7.1) we present a simulation study (Section 7.2) and a benchmark study (Section 7.3), including a discussion of computation loads (Section 7.4). Implementation details of the experiments are provided in Appendix A.
The new classifiers are compared with four classification rules: linear (LDA) and quadratic (QDA) discriminant analysis, -nearest-neighbors () classification and the maximum-depth classification (employing the three depth notions), all operating in a properly chosen location-slope space that is constructed with the bounding technique VCcrossLS of Section 6. (One may argue that the -based choice of is not generally suited for , but it delivers comparable results in reasonable time, much faster than cross-validation over all -pairs.) Also, the four classifiers mentioned above, together with the two new ones (with all three depths), are used when the dimension of the location-slope space is determined by non-restricted cross-validation crossLS. For further comparison, all 12 classifiers are applied in the finite-dimensional space constructed according to the methodology crossDHB of Delaigle et al. (2012).
LDA and QDA are calculated with classical moment estimates, and priors are estimated by the class portions in the training set. We include in our competitors as it is Bayes-risk consistent in the finite-dimensional setting and generally performs very well in applications. The -classifier is applied to the location-slope data in its affine invariant form. It is then defined as in (9), but with the Mahalanobis distance (determined from the pooled data) in place of the -distance. is selected by cross-validation.
As further competitors we consider three maximum depth classifiers. They are defined as
with being either Mahalanobis depth , or spatial depth , or projection depth . denotes the prior probability for class . The priors are estimated by the class portions in the training set. This classifier is Bayes optimal if the data comes from an elliptical location-shift model with known priors. For technical and implementation details the reader is referred to Appendix A.
7.2 Simulation settings and results
Next we explore our methodology by applying it to the simulation setting of Cuevas et al. (2007). Their data are generated from two models, each having two classes. The first model is Model 1:
where is a Gaussian process with and , discretized at 51 equally distant points on (), see Figure 6 (left) for illustration. The functions are smooth and differ in mean only, which makes the classification task rather simple.
We take 100 observations (50 from each class) for training and 100 (50 from each class) for evaluating the performance of the classifiers. Training and classification are repeated 100 times to get stable results. Figure 7 (left) presents boxplots (over 100 takes) of error rates of twelve classifiers applied after transforming the data to properly constructed finite-dimensional spaces. The top panel refers to a location-slope space, where and are selected by Vapnik-Chervonenkis restricted cross-validation (VCcrossLS), the middle panel to a location-slope space whose dimensions are determined by mere, unrestricted cross-validation (crossLS), the bottom panel to the finite-dimensional argument subspace constructed by the componentwise method crossDHB of (Delaigle et al., 2012). The classifiers are: linear (LDA) and quadratic (QDA) discriminant analysis, -nearest neighbors classifier (), maximum depth classifier with Mahalanobis (MD-M), spatial (MD-S) and projection (MD-P) depth, -plot classifier with rule based on distance (-M, -S, -P), and -classifier (-M, -S, -P), both with the three depths, correspondingly. The last approach (crossDHB) has not been combined with the projection depth for two reasons. First, performing the necessary cross-validations with the projection depth becomes computationally infeasible; for computational times see Table 14 in Appendix B. Second, the quality of approximation of the projection depth differs between the tries, and this instability is possibly misleading when choosing the optimal argument subspace, thus yielding rather high error rates; compare, e.g., the classification errors for growth data, Table 1 in Section 7.3.
As expected, all -plot-based classifiers, applied in a properly chosen location-slope space, show highly satisfactory performance, which is in line with the best result of Cuevas et al. (2007). The location-slope spaces that have been selected for the different classifiers, viz. the corresponding -pairs, do not much differ; see Table 3 in Appendix B. The picture remains the same when the location-slope space is chosen using unrestricted cross-validation crossLS, which yields no substantial improvement.
On the other hand, classifiers operating in an optimal argument subspace (crossDHB) are outperformed by those employed in the location-slope space (crossLS, VCcrossLS), although their error rates are still low; see Figure 7 (left). A plausible explanation could be that differences between the classes at each single argument point are not significant enough, and a bunch of them has to be taken for reasonable discrimination. But the sequential character of the procedure (as discussed in the Introduction and in Appendix A) prevents from choosing higher dimensions. Most often the dimensions two or three are chosen; see Table 11 in Appendix B. Our procedure appears to be better, as by integrating more information is extracted from the functions, so that they become distinguishable.
Next we consider an example, where averaging comes out to be rather a disadvantage. Model 2 of Cuevas et al. (2007) looks as follows:
with and as before. is an 8-knot spline approximation of . See Figure 6 (right) for illustration.
The corresponding boxplots of the error rates are depicted in Figure 7 (right). The results for individual classifiers are different. When the location-slope space is chosen using Vapnik-Chervonenkis bound (VCcrossLS), LDA, and all maximum depth classifiers perform poorly, while the -classifiers with Mahalanobis and spatial depths perform better. The -classifiers perform comparably. The last two lack efficiency when employed with projection depth; as seen from Table 4 in Appendix B, the efficient location-slope spaces are of substantially higher dimension (8 and more). Thus, the larger error rates with projection depth are explained by the insufficient number of random directions used in approximating this depth. Also, with projection depth, different to Mahalanobis and spatial depth, less efficient -pairs are preferred; see Tables 4 and 8 in Appendix B.
Choosing the location-slope space by unrestricted cross-validation (crossLS) does not change a lot. The error rates obtained with this location-slope space are larger than those obtained with the synthesized space (crossDHB), although with QDA and -plot-based classifiers they stay reasonable. In Model 2, taking several extreme points would be enough for distinguishing the classes, and the finite-dimensional spaces have most often dimension four, and sometimes three. (Note, that all -plot-based classifiers regard these dimensions as sufficient, and, together with QDA, deliver best results.) On the other hand, the classifiers operating in some location-slope space select efficient dimension 8 and higher, which is also seen from Tables 4 and 8 in Appendix B.
7.3 Comparisons on benchmark data
Now we come back to the two benchmark problems given in the Introduction. The growth data have been already analyzed by several authors. López-Pintado & Romo (2006) achieve a best classification error of 16.13 % when classifying with the distance from the trimmed mean and trimming is based on the generalized band depth with trimming parameter . Cuesta-Albertos & Nieto-Reyes (2010) use an average distance weighting with the random Tukey depth and get classification error 13.68 %. Cuevas et al. (2007) obtain mean error rate of 9.04 % when using the double random projection depth and 4.04 % with , dividing the sample into 70 training and 23 testing observations over 70 tries. Cuesta-Albertos et al. (2014) achieve 3.5 % when employing the LDA-classifier on a -plot that is based on the -mode depth applied to both location and slope. Sguera et al. (2014) get error rates of 3.45 % when classifying using kernelized functional spatial depth and choosing kernel parameter by means of cross-validation.
|growth data||medflies data|
Table 1 (columns growth data) provides errors, estimated by leave-one-out cross-validation, of different classification techniques. In this, either a proper location-slope space is based on Vapnik-Chervonenkis bound (column VCcrossLS), on unrestricted cross-validation (crossLS), or an optimal argument subset is chosen by the componentwise technique of Delaigle et al. (2012) (crossDHB). Note that with VCcrossLS classification by is best. It achieves error rate 3.23 %, which means here that only three observations are misclassified. It coincides with a best cross-validation result of Baíllo & Cuevas (2008). Both -plot-based classifiers perform well with all three depths, while the maximum depth classifiers MD-S and MD-P perform worse.
The Vapnik-Chervonenkis restricted cross-validation VCcrossLS seems to perform not much worse than the unrestricted cross-validation crossLS, and it mostly outperforms the componentwise approach crossDHB. The latter is particularly bad when the projection depth is involved. Tables 5, 9 and 13 in Appendix B exhibit how often various -pairs and dimensions of optimal argument subspace are chosen.
In general, all three space-constructing techniques allow for very low error rates, producing as little as three misclassifications if the classifier is properly chosen. The -classifiers on -spaces yield at most six misclassifications.
Compared to López-Pintado & Romo (2006) and Cuesta-Albertos & Nieto-Reyes (2010), the better performance of our classifiers appears to be due to the inclusion of average slopes. Observe that the acceleration period starting with approximately nine years discriminates particularly well between girls and boys; see Figure 2. Note that also the componentwise method prefers (location) values from this interval.
A much more involved real task is the classification of the medflies data. In Müller & Stadtmüller (2005) these data are analyzed by generalized functional linear models. The authors employ logit regression and semi-parametric quasi-likelihood regression. They get errors of 41.76 % and 41.2 %, respectively, also estimated by leave-one-out cross-validation.
We apply all classifiers to the same data in properly constructed location-slope spaces. With our procedure we are able to improve the differential between long- and short-lived flies. Especially, with the -classifier based on projection depth an error of 35.02 % is obtained (see Table 1, columns captioned medflies data for the errors). The role of the derivatives in building the location-slope space is emphasized in Tables 6 and 10 in Appendix B, which show the frequencies at which the various -pairs are selected. crossLS is outperformed in most of the cases. We were not able to compare the componentwise approach crossDHB as the computational load is too high. On an average, -classifiers perform very satisfactory. LDA and QDA with Vapnik-Chervonenkis bound, and maximum-depth classifiers with Mahalanobis and spatial depth (MD-M, MD-S), also deliver reasonable errors.
Note that, in configuring the location-slope space with crossLS, lower errors could be obtained by using finer (e.g. leave-one-out) cross-validations. To make componentwise classification feasible and the comparison fair, we have used only 10-fold cross-validation in all procedures besides . ( in and -plot-based has been determined by leave-one-out cross-validation.) For exact implementation details the reader is referred to Appendix A.
7.4 Computational loads
Most methods of functional classification tend to be time consuming because of their needs for preprocessing, smoothing and parameter-tuning, but the literature on such methods usually does not discuss computation times. Nevertheless this is an important practical issue. Our procedure comes out to be particularly efficient due to three main reasons. Firstly, an eventual cross-validation is restricted to very few iterations. Secondly, the depth space, where the final classification is done, has low dimension, which equals the number of classes. Thirdly, the linear interpolation requires no preprocessing or smoothing of the data.
To illustrate this we briefly present the durations of both training and classification phases for the two real data sets and the twelve classification techniques in Table 2. (Classification time of a single object is reported in parentheses below.) As the times depend on implementation and platform used, we also report (in square brackets) the numbers of cross-validations done, as this measure is independent of the eventual classification technique. The training times have been obtained as the average over all runs of leave-one-out cross-validation (thus using 92, respectively 533, observations for growth and medflies data). This comes very close to the time needed to train with the entire data set, as the difference of one observation is negligible. The classification times in Table 2 have been obtained in the same way, i.e. averaging the classification of each single observation over all runs of the leave-one-out cross-validation. The same holds for the number of cross-validating iterations. For the componentwise classifiers (crossDHB) the averages have been replaced by the medians for the following reason. The sequential character of the procedure causes an exponential increase of time with each iteration (in some range; see Appendix A for implementation details). Therefore, occasionally the total computation time can be outlying. (In our study, once the time exceeded four hours, viz. when classifying growth data with the -plot-based -classifier and projection depth, which required 20450 iterations to cross-validate.) On the other hand, when employing faster classifiers (which usually require stronger assumption on the data) the training phase can take less than two minutes. (This has been pointed by Delaigle et al. (2012) as well.)
|growth data||medflies data|
With growth data training times are substantially higher when choosing an -pair by unrestricted cross-validation than when the Vapnik-Chervonenkis bound is employed. Though, for the fast maximum depth classifier (with Mahalanobis or spatial depth) computation times appear quite comparable. Application of crossDHB causes an enormous increase in time (as well as in the number of cross-validations needed). For medflies data, as expected, VCcrossLS is faster than crossLS. We were not able to perform the leave-one-out cross-validation estimation for the componentwise method for this data set, because of its excessive computational load.
See also Table 14 for the same experiment regarding simulated data. Here, the projection-depth-based classifiers have not been implemented at all, as they need too much computation time.
An efficient nonparametric procedure has been introduced for binary classification of functional data. The procedure consists of a two-step transformation of the original data plus a classifier operating on the unit square. The functional data are first mapped into a finite-dimensional location-slope space and then transformed by a multivariate depth function into the -plot, which is a subset of the unit square. Three alternative depth functions are employed for this, as well as two rules for the final classification on .
Our procedure either outperforms or matches existing approaches on simulated as well as on real benchmark data sets. The results of the -plot-based and the -procedure are generally good, although, (cf. Model 2) they are slightly outperformed by the componentwise classification method of Delaigle et al. (2012).
As the raw data are linearly interpolated, no spurious information is added. The core of our procedure is the new data-dependent construction of the location-slope space. We bound its dimension by a Vapnik-Chervonenkis bound. The subsequent depth transformation into the unit square makes the procedure rather robust since the final classification is done on a low-dimensional compact set.
Our use of statistical data depth functions demonstrates the variety of their application and opens new prospects when considering the proposed location-slope space. To reflect the dynamic structure of functional data, the construction of this space, in a natural way, takes levels together with derivatives into account. As it has been shown, efficient information extraction is done via piece-wise averaging of the functional data in its raw form, while the changing of the functions with their argument is reflected by their average slopes.
The finite-dimensional space has to be constructed in a way that respects the important intervals and includes most information. Here, equally spaced intervals are used that cover the entire domain but have possibly different numbers for location and slope. This gives sufficient freedom in configuring the location-slope space. Note, that in view of the simulations as well as the benchmark results, choosing a particular depth function is of limited relevance only. While, depending on given data, different intervals are differently relevant, location and slope may differ in information content as well. The set of reasonable location-slope spaces is enormously reduced by application of the Vapnik-Chervonenkis bound, and the selection is done by fast cross-validation over a very small set. The obtained finite-dimensional space can be augmented by coordinates reflecting additional information on the data, that may be available. While our procedure is designed for the analysis of functions that are piecewise smooth, it can be also used for functions that are just piecewise continuous (by simply fixing the parameter at 0). Obviously, higher order derivatives can be involved, too. But obtaining those requires smooth extrapolation, which affords additional computations and produces possibly spurious information.
The new approach is presented here for classes, but it is not limited to this case. If , is applied in the -dimensional depth space without changes, and the -classifier is extended by either constructing one-against-all or pair-wise classifiers in the depth space, and finally performing some aggregation in the classification phase; (see also Lange et al., 2014a).
Our space selection technique (incomplete cross-validation, restricted by a Vapnik-Chervonenkis bound) is compared, both in terms of error rates and computational time, with a full cross-validation as well as with the componentwise space synthesis method of Delaigle et al. (2012). We do this for all variants of the classifiers.
In future research comparisons with other existing functional classification techniques as well as the use of other finite-dimensional classifiers on the -plot are needed. Refined data-dependent procedures, which size the relevant intervals and leave out irrelevant ones, may be developed to configure the location-slope space. However such refinements will possibly conflict with the efficiency and generality of the present approach.
We thank Dominik Liebl for his critical comments on an earlier version of the manuscript, as well as Ondrej Vencalek and Aurore Delaigle for their helpful remarks. The reading and suggestions of two referees are also gratefully acknowledged.
Appendix A Implementation details
In calculating the depths, and for the Mahalanobis depth have been determined by the usual moment estimates and similarly, for the spatial depth. The projection depth has been approximated by drawing 1 000 directions from the uniform distribution on the unit sphere. Clearly, the number of directions needed for satisfactory approximation depends on the dimension of the space. Observe that for higher-dimensional problems 1 000 directions are not enough, which becomes apparent from the analysis of Model 2 in Section 7.2. There the location-slope spaces chosen have dimension eight and higher; see also Tables 4 and 8 in Appendix B. On the other hand, calculating the projection depth even in one dimension costs something. Computing 1 000 directions to approximate the projection depth takes substantially more time than computing the exact Mahalanobis or spatial depths (see Tables 2 and 14 in Appendix B).
LDA and QDA are used with classical moment estimates, and priors are estimated by the class portions in the training set. The -classifier is applied to location-slope data in its affine invariant form, based on the covariance matrix of the pooled classes. For time reasons, its parameter is determined by leave-one-out cross-validation over a reduced range, viz. . The -procedure separating the -plot uses polynomial space extensions with maximum degree three; the latter is selected by cross-validation. To keep the training speed of the depth-based -classifier comparable with that of the -classifier, we also determine by leave-one-out cross-validation on a reduced range of .
Due to linear interpolation, the levels are integrated as piecewise-linear functions, and the derivatives as piecewise constant ones. If the dimension of the location-slope space is too large (in particular for inverting the covariance matrix, as it can be the case in Model 2), PCA is used to reduce the dimension. Then is estimated and all further computations are performed in the subspace of principal components having positive loadings.
To construct the location-slope space, firstly all pairs satisfying are considered. ( amounts to 26 for the synthetic and to 16 for the real data sets.) For each the data are transformed to , and the Vapnik-Chervonenkis bound is calculated. Then those five pairs are selected that have smallest . Here, tied values of are taken into account as well, with the consequence that on an average slightly more than five pairs are selected; see the growth data in Table 2 and both synthetic models in Table 14 of Appendix B. Finally, among these the best -pair is chosen by means of cross-validation. Note that the goal of this cross-validation is not to actually choose a best location-slope dimension but rather to get rid of obviously misleading -pairs, which may yield relatively small values of . This is seen from Figures 4 and 5. When determining an optimal -pair by crossLS, the same set of -pairs is considered as with VCcrossLS.
In implementing the componentwise method of finite-dimensional space synthesis (crossDHB) we have followed Delaigle et al. (2012) with slight modifications. The original approach of Delaigle et al. (2012) is combined with the sequential approach of Ferraty et al. (2010). Initially, a grid of equally () distanced discretization points is built. Then a sequence of finite-dimensional spaces is synthesized by adding points of the grid step by step. We start with all pairs of discretization points that have at least distance . (Note that Delaigle et al. (2012) start with single points instead of pairs.) The best of them is chosen by cross-validation. Then step by step features are added. In each step, that point that has best discrimination power (again, in the sense of cross-validation) when added to the already constructed set is chosen as a new feature. The resulting set of points is used to construct a neighborhood of combinations to be further considered. As a neighborhood we use twenty -distanced points in the second step, and ten in the third; from the fourth step on the sequential approach is applied only.
All our cross-validations are ten-fold, except the leave-one-out cross-validations in determining with both -classifiers. Of course, partitioning the sample into ten parts only may depreciate our approach against a more comprehensive leave-one-out cross-validation. We have chosen it to keep computation times of the crossDHB approach (Delaigle et al., 2012) in practical limits and also to make the comparison of approaches equitable throughout our study.
The calculations have been implemented in an R-environment, based on the R-package “ddalpha” (Mozharovskyi et al., 2015), with speed critical parts written in C++. The R-code implementing our methodology as well as that performing the experiments can be obtained upon request from the authors. In all experiments, one kernel of the processor Core i7-2600 (3.4 GHz) having enough physical memory has been used. Thus, regarding the methodology of Delaigle et al. (2012) our implementation differs from their original one and, due to its module-based structure, may result in larger computation times. For this reason we provide the number of cross-validations performed; see Tables 2 and 14 of Appendix B. The comparison appears to be fair, as we always use ten-fold cross-validation together with an identical set of classification rules in the finite-dimensional spaces.
Appendix B Additional tables