Fast Ddclassification of functional data
Abstract
A fast nonparametric procedure for classifying functional data is introduced. It consists of a twostep transformation of the original data plus a classifier operating on a lowdimensional space. The functional data are first mapped into a finitedimensional locationslope space and then transformed by a multivariate depth function into the plot, which is a subset of the unit square. This transformation yields a new notion of depth for functional data. Three alternative depth functions are employed for this, as well as two rules for the final classification in . The resulting classifier has to be crossvalidated over a small range of parameters only, which is restricted by a VapnikChervonenkis bound. The entire methodology does not involve smoothing techniques, is completely nonparametric and allows to achieve Bayes optimality under standard distributional settings. It is robust, efficiently computable, and has been implemented in an R environment. Applicability of the new approach is demonstrated by simulations as well as by a benchmark study.
Keywords: Functional depth; Supervised learning; Central regions; Locationslope depth; DDplot; Alphaprocedure; Berkeley growth data; Medflies data.
1 Introduction
The problem of classifying objects that are represented by functional data arises in many fields of application like biology, biomechanics, medicine and economics. Ramsay & Silverman (2005) and Ferraty & Vieu (2006) contain broad overviews of functional data analysis and the evolving field of classification. At the very beginning of the 21st century many classification approaches have been extended from multivariate to functional data: linear discriminant analysis (James & Hastie, 2001), kernelbased classification (Ferraty & Vieu, 2003), nearestneighbours classifier (Biau et al., 2005), logistic regression (Leng & Müller, 2006), neural networks (Ferré & Villa, 2006), support vector machines (Rossi & Villa, 2006). Transformation of functional data into a finite setting is done by using principal and independent (Huang & Zheng, 2006) component analysis, principal coordinates (Hall et al., 2001), wavelets (Wang et al., 2007) or functions of very simple and interpretable structure (Tian & James, 2013), or some optimal subset of initially given evaluations (Ferraty et al., 2010; Delaigle et al., 2012).
Generally, functional data is projected onto a finite dimensional space in two ways: by either fitting some finite basis or using functional values at a set of discretization points. The first approach accounts for the entire functional support, and the basis components can often be well interpreted. However, the chosen basis is not per se best for classification purposes, e.g., Principal Component Analysis (PCA) maximizes dispersion but does not minimize classification error. Moreover, higher order properties of the functions, which are regularly not incorporated, may carry information that is important for classification (see Delaigle & Hall, 2012, for discussion). The second approach appears to be natural as the finitedimensional space is directly constructed from the observed values. But any selection of discretization points restricts the range of the values regarded, so that some classificationrelevant information may be lost. Also, the data may be given at arguments of the functions that are neither the same nor equidistant nor enough frequent. Then some interpolation is needed and interpolated data instead of the original one are analyzed. Another issue is the way the space is synthesized. If it is a heuristic (as in Ferraty et al., 2010), a well classifying configuration of discretization points may be missed. (To see this, consider three discretization points which jointly discriminate well in constructed from their evaluations, but which cannot be chosen subsequently because each of them has a relatively small discrimination power compared to some other available discretization points.) To cope with this problem, Delaigle et al. (2012) consider (almost) all sets of discretization points that have a given cardinality; but this procedure involves an enormous computational burden, which restricts its practical application to rather small data sets.
LópezPintado & Romo (2006), Cuevas et al. (2007), CuestaAlbertos & NietoReyes (2010) and Sguera et al. (2014) have introduced nonparametric notions of data depth for functional data classification (see also CuestaAlbertos et al., 2014). A data depth measures how close a given object is to an  implicitly given  center of a class of objects; that is, if objects are functions, how central a given function is in an empirical distribution of functions.
Specifically, the band depth (LópezPintado & Romo, 2006) of a function in a class of functions indicates the relative frequency of lying in a band shaped by any functions from , where is fixed. Cuevas et al. (2007) examine five functional depths for tasks of robust estimation and supervised classification: the integrated depth of Fraiman & Muniz (2001), which averages univariate depth values over the function’s domain; the mode depth, employing a kernel; the random projection depth, taking the average of univariate depths in random directions; and the double random projection depths that include first derivatives and are based on bivariate depths in random directions. CuestaAlbertos & NietoReyes (2010) classify the Berkeley growth data (Tuddenham & Snyder, 1954) by use of the random Tukey depth (CuestaAlbertos & NietoReyes, 2008). Sguera et al. (2014) introduce a functional spatial depth and a kernelized version of it. CuestaAlbertos et al. (2014) extensively study applications of the plot to functional classification, employing a variety of functional depths.
There are several problems connected with the depths mentioned above. First, besides double random projection depth, the functions are treated as multivariate data of infinite dimension. By this, the development of the functions, say in time, is not exploited. These depth notions are invariant with respect to an arbitrary rearrangement of the function values. Second, several of these notions of functional depth break down in standard distributional settings, i.e. the depth functions vanish almost everywhere (see Chakraborty & Chaudhuri, 2014; Kuelbs & Zinn, 2013). Eventually, the depth takes empirical zero values if the function’s hypergraph has no intersection with the hypograph of any of the sample functions or vice versa, which is the case for both halfgraph and band depths and their modified versions, as well as for the integrated depth. If a function has zero depth with respect to each class it is mentioned as an outsider, because it cannot be classified immediately and requires an additional treatment (see Lange et al., 2014a; Mozharovskyi et al., 2015).
1.1 Two benchmark problems
Naturally, the performance of a classifier and its relative advantage over alternative classifiers depend on the actual problem to be solved. Therefore we start with two benchmark data settings, one fairly simple and the other one rather involved.
First, a popular benchmark set is the growth data of the Berkeley Growth Study (Tuddenham & Snyder, 1954). It comprises the heights of 54 girls and 39 boys measured at 31 nonequally distant time points, see Figure 1, left. This data was extensively studied in Ramsay & Silverman (2005). After having become a wellbehaved classic it has been considered as a benchmark in many works on supervised and unsupervised classification, by some authors also in a data depth context. Based on the generalized band depth LópezPintado & Romo (2006) introduce a trimming and weighting of the data and classify them by their (trimmed) weighted average distance to observation or their distance to the trimmed mean. Cuevas et al. (2007) classify the growth data by their maximum depth, using the five above mentioned projectionbased depths and as a reference (see also CuestaAlbertos & NietoReyes, 2010). Recently, Sguera et al. (2014) apply functional spatial depth to this data set, and CuestaAlbertos et al. (2014) perform plot classification.
Second, we consider a much more difficult task: classifying the medflies data of Carey et al. (1998), where 1000 30day (starting from the fifth day) egglaying patterns of Mediterranean fruit fly females are observed. The classification task is to explain longevity by productivity. For this a subset of 534 flies living at least 34 days is separated into two classes: 278 living 10 days or longer after the 34th day (longlived), and 256 those have died within these 10 days (shortlived), which are to be distinguished using daily egglaying sequences (see Figure 1, right, with the linearly interpolated evaluations). This task is taken from Müller & Stadtmüller (2005), who demonstrate that the problem is quite challenging and cannot be satisfactorily solved by means of functional linear models.
1.2 The new approach
We shall introduce a new methodology for supervised functional classification covering the mentioned issues and validate it with the considered real data sets as well as with simulated data. Our approach involves a new notion of functional data depth. It is completely nonparametric, oriented to work with raw and irregularly sampled functional data, and does not involve heavy computations. It extends the idea of the locationslope depth (see Mosler & Polyakova, 2012), and employs the depthtodepth classification technique proposed by Li et al. (2012), the classification of Lange et al. (2014a) and the depthbased , first suggested by Vencálek (2011).
Clearly, as any statistical methodology for functional data, our classification procedure has to map the relevant features of the data to some finite dimensional setting. For this we map the functional data to a finitedimensional locationslope space, where each function is represented by a vector consisting of integrals of its levels (‘location’) and first derivatives (‘slope’) over respectively equally sized subintervals.
We interpolate the functions linearly; hence their levels are integrated as piecewiselinear functions, and the derivatives as piecewise constant ones. Then we classify the data within the locationslope space using a proper depthbased technique. We restrict by a VapnikChervonenkis bound and determine and by crossvalidation of the final classification error, which is very quickly done. Thus, the loss of information caused by the finitedimensional representation of data, which cannot be avoided, is measured by the final error of our classification procedure and minimized. This contrasts standard approaches using PCA projections or projections on spline bases, which penalize the loss of information with respect to a functional subspace but not to the ability of discrimination.
To construct the classification rule, the training classes in space are transformed to a depthdepth plot (plot, see Li et al., 2012), which is a twodimensional set of points, each indicating the depth of a data point regarding the two training classes. Finally, the classes are separated on the plot by either applying a nearestneighbor () rule (Vencálek, 2011) or the procedure (Lange et al., 2014a).
The resulting functional data depth, called integral locationslope depth has a number of advantages. It is based on a linear mapping which does not introduce any spurious information and can preserve Bayes optimality under standard distributional assumptions. Already in the first step, the construction of the finitedimensional space aims at the minimization of an overall goal function, viz. the misclassification rate. The simplicity of the mapping allows for its efficient computation.
The procedure has been implemented in the Rpackage ddalpha.
1.3 Overview
The rest of the paper is organized as follows: Section 2 reviews the use of data depth techniques in classifying finitedimensional objects. Section 3 presents the new twostep representation of functional data, first in a finitedimensional Euclidean space (the locationslope space), and then in a depthtodepth plot (plot). In Section 4 we briefly introduce two alternative classifiers that operate on the plot, a nearestneighbor procedure and the procedure. Section 5 treats some theoretical properties of the developed classifiers. Section 6 provides an approach to bound and select the dimension of the locationslope space. In Section 7 our procedure is applied to simulated data and compared with other classifiers. Also, the above two real data problems are solved and the computational efficiency of our approach is discussed. Section 8 concludes. Implementation details and additional experimental results are collected in the Appendix.
2 Depth based approaches to classification in
For data in Euclidean space many special depth notions have been proposed in the literature; see, e.g., Zuo & Serfling (2000) for definition and properties. Here we mention three depths, Mahalanobis, spatial and projection depth. These depths are everywhere positive. Hence they do not produce outsiders, that is, points having zero depth in both training classes. This nooutsider property appears to be essential in obtaining functional depths for classification.
For a point and a random vector having an empirical distribution on in the Mahalanobis depth (Mahalanobis, 1936) of w.r.t. is defined as
(1) 
where measures the location of , and the scatter.
The affine invariant spatial depth (Vardi & Zhang, 2000; Serfling, 2002) of regarding is defined as
(2) 
where denotes the Euclidean norm, for and , and is the covariance matrix of . and can be efficiently computed in .
The projection depth (Zuo & Serfling, 2000) of regarding is given by
(3) 
with
(4) 
where denotes the univariate median and the median absolute deviation from the median. Exact computation of is, in principle, possible (Liu & Zuo, 2014) but practically infeasible when . Obviously, is approximated from above by calculating the minimum of univariate projection depths in random directions. However, as is piecewise linear (and, hence, attains its maximum on the edges of the direction cones of constant linearity), a randomly chosen direction yields the exact depth value with probability zero. For a sufficient approximation one needs a huge number of directions, each of which involves the calculation of the median and MAD of a univariate distribution.
To classify objects in Euclidean space , the existing literature employs depth functions in principally two ways:

Classify the original data by their maximum depth in the training classes.

Transform the data by their depths into a lowdimensional depth space, and classify them within this space.
Ad 1: Ghosh & Chaudhuri (2005) propose the maximum depth classifier, which assigns an object to the class in which it has maximum depth. In its naive form this classifier yields a linear separation. Maximumdepth classifiers have a plugin structure; their scale parameters need to be tuned (usually by some kind of crossvalidation) over the whole learning process. For this, Ghosh & Chaudhuri (2005) combine the naive maximumdepth classifier with an estimate of the density. A similar approach is pursued with projection depth in Dutta & Ghosh (2012a) and depth in Dutta & Ghosh (2012b), yielding competitive classifiers.
Ad 2: Depth notions are also used to reduce the dimension of the data. Li et al. (2012) employ the plot, which represents all objects by their depth in the two training classes, that is, by points in the unit square. (The same for training classes in the dimensional unit cube.) To solve the classification task, some separation rule has to be constructed in the unit square. Li et al. (2012) minimize the empirical risk, that is the average classification error on the training classes, by smoothing it with a logistic sigmoid function and, by this, obtain a polynomial separating rule. They show that their approach (with Mahalanobis, projection and other depths) asymptotically achieves the optimal Bayes risk if the training classes are strictly unimodal elliptically distributed. However, in practice the choice of the smoothing constant and nonconvex optimization, potentially with many local minima, encumber its application. In Lange et al. (2014a) these problems are addressed via the procedure, which is very fast and speeds up the learning phase enormously. Use of other procedures in the plot is discussed in CuestaAlbertos et al. (2014).
3 A new depth transform for functional data
Let be the space of real functions, defined on a compact interval, which are continuous and smooth at all points besides a finite set, and let be endowed with the supremum norm. The data may be given either as observations of complete functions in or as functional values at some discretization points, in general neither equidistant nor common ones. If the functional data is given in discretized form, it is usually interpolated by splines of some order (see Ramsay & Silverman, 2005), so that sufficiently smooth functions are obtained. Here we use linear interpolation (that is splines of order 1) for the following reasons. Firstly, with linear interpolation, neither the functions nor their derivatives need to be smoothed. Thus almost any raw data can be handled easily. E.g., the medflies data, as to the egglaying process, is naturally discrete; see Figure 1 (right). Secondly, higher order splines increase the computational load, especially when the number of knots or the smoothing parameter are to be determined as part of the task. Thirdly, splines of higher order may introduce spurious information.
We construct a depth transform as follows. In a first step, the relevant features of the functional data are extracted from into a finitedimensional Euclidean space , which we call the locationslope space (Section 3.1). Then an dimensional depth is applied to the transformed data yielding a plot in the unit square (Section 3.2), which represents the two training classes. Finally the separation of the training classes as well as the classification of new data is done on the plot (Section 4).
3.1 The locationslope transform
We consider two classes of functions in , and , which are given as measurements at ordered points , ,
(5) 
Assume w.l.o.g. and let . From a function is obtained as follows. Connect the points with line segments and set when and when . By this, the data become piecewise linear functions on , and their first derivatives become piecewise constant ones, see Figure 2, left.
A finitedimensional representation of the data is then constructed by integrating the interpolated functions over subintervals (location) and their derivatives over subintervals (slope), see Figure 2 (left). Thus, our locationslope space has dimension . It delivers the following transform,
That is, the average values and slopes constitute a point in . Either or must be positive, . In case or the formula (3.1) is properly modified. Put together we obtain a composite transform ,
(7) 
which we call the locationslope (LS) transform. Note that also Cuevas et al. (2007) and CuestaAlbertos et al. (2014) include derivatives in their functional depths, but in a different way.
For example, choose and for the growth data. Then they are mapped into the locationslope space , which is shown in Figure 2 (right). Here the functions’ first derivatives are averaged on two halfintervals. That is, for each function two integrals of the slope are obtained: the integral over and that over . Here, the location is not incorporated at all. Figure 2, left, exhibits the value (height) and first derivative (velocity) of a single interpolated function, which is then represented by the average slopes on the two halfintervals, yielding the rectangular point in Figure 2 (right).
Further, given the data, and can be chosen large enough to reflect all relevant information about the functions. (Note that the computational load of the whole remaining procedure is linear in dimension .) If and are properly chosen, under certain standard distributional assumptions, the transform preserves asymptotic Bayes optimality; see Section 5 below.
Naturally, only part of these intervals carries the information needed for separating the two classes. Delaigle et al. (2012) propose to determine a subset of points in based on which the training classes are optimally separated. However they do not provide a practical procedure to select these points; in applications they use crossvalidation. Generally, we have no prior reason to weight the intervals differently. Therefore we use intervals of equal length, but possibly different ones for location and slope.
The question remains how many equally sized subintervals, for location and for slope, should be taken. We will see later in Section 7 that our classifier performs similar with the three depths when the dimension is low. In higher dimensions the projection depth cannot be computed precisely enough, so that the classifier becomes worse. The performance is most influenced by the construction of the locationslope transform, that is, by the choice of the numbers and . We postpone this question to Section 6.
3.2 The transform
Denote the locationslopetransformed training classes in by and . The plot is then
Here is an dimensional depth. In particular, may be , or , each of which does not produce outsiders. The plots of these three for growth data, taking and , are pictured in Figure 3. Clearly, in this case, almost faultless separation is achievable by drawing a straight line through the origin.
4 plot classification
The training phase of our procedure consists in the following. After the training classes have been mapped from to the plot as described in Section 3, a selector is determined in the plot that separates the transformed data. For the latter, we consider two classifiers operating on the plot, and compare their performance. Firstly we propose the classifier. It converges to the Bayes rule under standard elliptical models; see Section 4.1. Note that needs to be crossvalidated on the plot, which is computationally expensive. Therefore, secondly, we employ the procedure, which is a very fast heuristic; see Section 4.2. Although its theoretical convergence is not guaranteed (see Lange et al., 2014a), the classifier performs very well in applications.
4.1 classification on the plot
When applied to multivariate data, (under mild assumptions) is known to be a consistent classifier. By multiplying all distances with the inverse covariance matrix of the pooled data an affine invariant version is obtained. In our procedure we employ an affineinvariant classifier on the plot. It will be shown (Section 5) that, if the underlying distribution of each class is strictly unimodal elliptical with the same radial density, the classifier, operating on the plot, achieves asymptotically the optimal Bayes risk.
Given a function to be classified, we represent (according to Section 3) as and then as . According to on the plot is classified as follows:
(9) 
where denotes the indicator function of a set , and is the neighborhood of defined by the closest observations. We use the distance on the plot; thus is the smallest rectangle centered at that contains training observations. In applications has to be chosen, usually by means of crossvalidation.
4.2 classification
The second classification approach is the classifier, introduced in Lange et al. (2014a). It uses a projectiveinvariant method called the procedure (Vasil’ev & Lange, 1998), which is a heuristic classification procedure that iteratively decreases the empirical risk. The classifier employs a modified version of the procedure to construct a nonlinear separator in that classifies the depthrepresented data points. The construction is based on depth values and the products of depth values (= Dfeatures) up to some degree that can be either chosen a priori or determined by crossvalidation. In a stepwise way, linear discrimination is performed in subspaces of the feature space. In each step a pair of features is replaced by a new feature as long as the average misclassification rate decreases and basic features are left to be replaced. For details, we refer to Lange et al. (2014a).
We employ three depths (Mahalanobis, spatial, and projection depths), which are positive on the whole and thus do not produce outsiders. It is known that the classifier is asymptotical Bayesoptimal for the locationshift model (see Lange et al., 2014a) and performs well for broad classes of simulated distributions and a wide variety of real data (see Lange et al., 2014b; Mozharovskyi et al., 2015). The main advantage of the classifier is its high training speed, as it contains the procedure which, on the plot, has the quicksort complexity and proves to be very fast. The separating polynomial is constructed by space extensions of the plot (which is of low dimension) and crossvalidation.
5 Some theoretical properties of the classifiers
In this section, we treat optimality properties of classification procedures that operate on a DDplot of transformed functional data. Besides classification and procedure they include linear and quadratic discriminant analysis as well as maximum depth classification, all applied to the DDplotted data. The latter procedures will be included in the comparative study of Section 7. First, in Subsection 5.1, it is shown that wellknown properties regarding the Bayes optimality of classifiers for multivariate data carry over to transformed functional data if the discretization points are fixed and is chosen large enough. Second, in Subsection 5.2, we introduce a sampling scheme that evaluates the functions at an increasing number of points, and establish Bayes optimality in the Gaussian case.
Let be the space of real functions, defined on the finite interval , which are continuous and piecewise smooth. Consider two stochastic processes, and on a common probability space , whose paths are in with probability 1. (For conditions on Gaussian processes that yield smooth paths, see e.g. Cambanis (1973).)
5.1 Fixeddiscretization optimality
Let be a given finite set of discretization points, and for each let be its linear interpolation based on , as described in Section 3. Denote the space of these linear interpolations by . Then is a finitedimensional linear space, which is isomorphic to . Consider two samples of independent random vectors in , , distributed as , , and , distributed as , , where and are the marginal distributions of resp. in , and let the two samples be independent from each other.
Proposition 1 (Bayes optimality on )
Consider a class of decision rules and assume that contains a sequence whose risk converges in probability to the Bayes risk. Then there exists a pair so that the same class of decision rules operating on transformed data in contains a sequence whose risk also converges in probability to the Bayes risk.
We say that a sequence of decision rules is Bayes optimal if their risk converges in probability to the Bayes risk. The proof of Proposition 1 is obvious: E.g., choose and . Then the Bayes optimality trivially carries over from to , hence to .
The Proposition allows for procedures that achieve error rates close to minimum. However note that it does not refer to Bayes optimality of classifying the underlying process, but just of classifying the dimensional marginal distributions corresponding to . In particular, if two processes have equal marginal distributions on , no discrimination is possible.
Corollary 1 (Optimality of and )
Let and be Gaussian processes and the priors of class membership be equal, and let proper and be selected as above. The following rules operating on transformed data are Bayes optimal, relative to the marginals:
 (i)

quadratic discriminant analysis (),
 (ii)

linear discriminant analysis (), if and have the same covariance function.
For a definition of LDA and QDA, see, e.g., Devroye et al. (1996), Ch. 4.4). Proof: As the processes are Gaussian, we obtain that
(10)  
(11) 
As by selecting the Bayes optimality is trivially preserved, the standard results of Fisher (see, again, Devroye et al. (1996)) apply; hence Corollary 1 holds.
The following result is taken from Lange et al. (2014a). Let be a hyperplane in , and let denote the projection on . A probability distribution is mentioned as the mirror image regarding of another probability distribution in if for and it holds: is distributed as .
Proposition 2 (Lange et al. (2014a))
Let and be probability distributions in having densities and , and let be a hyperplane such that is the mirror image of with respect to and in one of the halfspaces generated by . Then, based on a 50:50 independent sample from and , the procedure will asymptotically yield the linear separator that corresponds to the bisecting line of the plot.
Notice that due to the mirror symmetry of the distributions in the plot is symmetrically distributed as well. Symmetry axis is the bisector, which is obviously the result of the procedure when the sample is large enough. Then the rule corresponds to the Bayes rule. In particular, the requirements of the proposition are satisfied if and are mirror symmetric and unimodal.
A stochastic process is mentioned as a strictly unimodal elliptical process if its finitedimensional marginals are elliptical with the same radial density , that is , and is strictly decreasing. Here, denotes the shift vector and the structural matrix.
Corollary 2 (Optimality of  and rules)
Assume that the processes and are strictly unimodal elliptical and have the same radial density, and choose and as above. Then
(a) the following rules operating on transformed data are Bayes optimal, relative to the marginals:
 (i)

,
 (ii)

,
 (iii)

,
as with and .
(b) If, in addition, the structural matrices coincide, , and the priors of class membership are equal, then the following rules operating on transformed data are Bayes optimal, relative to the marginals:
 (iv)

the maximumdepth rule,
 (v)

the  rule,
 (vi)

the  rule,
 (vii)

the  rule,
as .
Proof: Parts (i) to (iii) follow from Proposition 1 above and Theorem 3.5 in Vencálek (2011). Parts (v) to (vii) of the Corollary 2 are a consequence of Proposition 1 and Proposition 2. Part (iv) is deduced from Proposition 1 and Ghosh & Chaudhuri (2005), who demonstrate that the maximumdepth rule is asymptotically equivalent to the Bayes rule if the two distributions have the same prior probabilities and are elliptical in with only a location difference.
The above optimality results regard rules that operate on a DDplot of transformed data. For comparison, we will also apply the rule directly to transformed data.
Corollary 3 (Optimality of )
The classifier applied to transformed data (where and are selected as above) is Bayes optimal, relative to the marginals.
5.2 Functional optimality
Next let us consider the asymptotic behavior of our procedure regarding the functions themselves. The above results refer to Bayes optimality with respect to the dimensional marginal distributions of the two processes that correspond to a fixed set of discretization points. Obviously they allow no inference on the whole functions.
In order to classify asymptotically with respect to the processes, we introduce a sequential sampling scheme that evaluates the functions at an increasing number of discretization points. For this, a slightly different notation will be needed.
For each consider a set of discretization points, , such that , , , for all , and
(12) 
For assume that are i.i.d. sampled paths from , and that each is observed at times , . Similarly assume that are i.i.d. from , each being observed at times , and that the two samples are independent of each other.
In this setting, we apply our classification procedure with and depending on . By crossvalidation we select an optimal from the set
For , let be the linearly interpolated function that is supported at , , , and define similarly. denotes the empirical distribution on , and and that on . Then, obviously, the classification based on the transformation, with being selected from via cross validation, is optimal regarding the empirical distributions and .
Now we can extend Corollary 1 to the case of process classification:
Theorem 1 (Bayes optimality with respect to Gaussian processes)
Let and be Gaussian processes that have piecewise smooth paths, and let the priors of class membership be equal. Under the above sequential sampling scheme and if is selected from , the following rules operating on the transformed data are almost surely Bayes optimal regarding the processes and :
 (i)

quadratic discriminant analysis (QDA),
 (ii)

linear discriminant analysis (LDA), if and have the same covariance function.
Proof: As the processes are Gaussian, we obtain that
(13)  
(14) 
where , and the are covariance matrices, . Observe that the optimal (i.e. crossvalidated) error rate converges to the probability of error. If is selected from , then, as in Corollary 1, the Bayes risk is achieved with respect to the empirical distributions. Theorem 1(ii) of Nagy & Gijbels & Hlubinka (2015) says that under the above sampling scheme it holds
(15) 
As for every the Bayes risk is achieved with respect to and , (15) implies that, in the limit (), the Bayes risk is almost surely achieved with respect to the processes and .
6 Choosing the dimensions and
Clearly, the performance of our classifier depends on the choice of and , which has still to be discussed. In what follows we assume that the data is given as functional values at a (usually large) number of discretization points. Let denote the number of these points.
Delaigle et al. (2012) propose to perform the classification in a finitedimensional space that is based on a subset of discretization points selected to minimize the average error. But these authors do not offer a construction rule for the task but rely on multilayer leaveoneout crossvalidation, which is very timeconsuming. Having recognized this problem they suggest some timesaving modifications of the crossvalidation procedure. Clearly, the computational load is then determined by the crossvalidation scheme used. It naturally depends on the size of the data sample, factored with the training time of the finitedimensional classifier, which may also include parameters to be crossvalidated.
The Delaigle et al. (2012) approach (abbreviated crossDHB for short) suggests a straightforward procedure for our problem of constructing an space: allow for a rich initial set of possible pairs and then use crossvalidation in selecting a pair that (on an average) performs best. The initial set shall consist of all pairs , say, satisfying . Other upper bounds may be used, e.g. if exceeds the number of observations. In addition, a dimensionreducing technique like PCA or factor analysis may be in place. But this crossvalidation approach (crossLS for short), similar to the one of Delaigle et al. (2012), is still timeconsuming; see Section 7.4 for computing times. The problem of selecting an appropriate pair remains challenging.
In deciding about and , we will consider the observed performance of the classifier as well as some kind of worstcase performance. Fortunately, the conservative error bound of Vapnik and Chervonenkis (Vapnik & Chervonenkis, 1974), see also Devroye et al. (1996), provides some guidance. The main idea is to measure how good a certain locationslope space can be at all, by referring to the worstcase empirical risk of a linear classifier on the training classes. Clearly, the class of linear rules is limited, but may be regarded as an approximation. Also, this limitation keeps the deviation from empirical risk small and allows for its implicit comparison with the empirical risk itself. Moreover, in choosing the dimension of the locationslope space we may adjust for the required complexity so that finally the separation rule is close to a linear rule.
Here, linear discriminant analysis (LDA) is used for the linear classification. Although other approaches like perceptron, regression depth, support vector machine, or procedure can be employed instead, we use LDA as it is fast and linear. Below we derive the VapnikChervonenkis conservative bound for our particular case; this is similar to Theorem 5.1 in Vapnik & Chervonenkis (1974). Here we use another bound on the cardinality of the class of separating rules (see also Devroye et al., 1996).
Consider the transformed data and , as two samples of random vectors that are distributed as and , respectively. Let be a class of separation rules , where two different rules separate into two different subsets. Then each produces an expected classification error,
with and being prior probabilities of and , and being an indicator function. As mentioned above, can be consistently estimated by leaveoneout crossvalidation from and . However, this estimate comes at substantial computational cost. On the other hand, based on the empirical risk , the expectation can be bounded from above. The empirical risk is given by
(17) 
As discussed before, we constrain to be the class of linear rules. Then the number of possible rules equals the number of different separations of by hyperplanes . Note that is the number of possible different separations of points in that can be achieved by hyperplanes containing the origin. Consequently, it holds , and equality is attained if are in general position. Applying Hoeffding’s (1963) inequality, the probability that the highest (over the class ) excess of the classification error over the empirical risk will not exceed some positive value can be bounded from above:
(18)  
This allows to formulate the following proposition:
Proposition 3
For the class of linear separating rules in , it holds with probability :
(19) 
Proof: Requiring yields . Solving the last w.r.t. gives . Then, with probability , equation (19) follows immediately.
The probability is fixed to some small constant. Here we use , given the sample size. The goodness of a certain choice of and is then measured by
(20) 
We restrict our choice of and to those that make small. An intuitive justification is the following: refers to empirical risk, while the second term penalizes the dimension, which balances fit and complexity.
To find out whether the proposed bound really helps in finding proper dimensions and , we first apply our approach to two simulated data settings of Cuevas et al. (2007), called ‘Model 1’ and ‘Model 2’ (both ). The data generating process of Cuevas et al. (2007) is described in Section 7.2 below. We determine and , use the Mahalanobis depth to construct a plot and apply the classifier, which is abbreviated in the sequel as . For each pair with and , the risk bound and the average classification error (ACE) are calculated by averaging over 100 takes and plotted in Figure 4. Note that these patterns look similar when spatial or projection depth is used. Further, for the two benchmark data problems, we estimate ACE by means of leaveoneout crossvalidation (see Figure 5). Here all pairs are considered with and .
As expected, Figures 4 and 5 largely support the statement “the less the smaller the ACE”. Although for the challenging medflies data (Figure 5, right) the plot remains a fuzzy scatter, a substantial part of points with small error have low values of . (Even more, the pair with the smallest actually corresponds to the smallest error, see the point in the lower left corner.) Thus still guides us in configuring the locationslope space, as it is affirmed by our experimental results in Section 7.3 below. Computing involves a single calculation of the LDAclassifier (viz. to estimate ), which is done in negligible time. Then, the pair with the smallest can be picked. We restrict attention to a few pairs that deliver smallest values of , and choose the best one by crossvalidating over this tiny range. This new technique is employed here for space building. We abbreviate it as VCcrossLS.
7 Experimental comparisons
To evaluate the performance of the proposed methodology we compare it with several classifiers that operate either on the original data or in a locationslope space. After introducing those competitors (Section 7.1) we present a simulation study (Section 7.2) and a benchmark study (Section 7.3), including a discussion of computation loads (Section 7.4). Implementation details of the experiments are provided in Appendix A.
7.1 Competitors
The new classifiers are compared with four classification rules: linear (LDA) and quadratic (QDA) discriminant analysis, nearestneighbors () classification and the maximumdepth classification (employing the three depth notions), all operating in a properly chosen locationslope space that is constructed with the bounding technique VCcrossLS of Section 6. (One may argue that the based choice of is not generally suited for , but it delivers comparable results in reasonable time, much faster than crossvalidation over all pairs.) Also, the four classifiers mentioned above, together with the two new ones (with all three depths), are used when the dimension of the locationslope space is determined by nonrestricted crossvalidation crossLS. For further comparison, all 12 classifiers are applied in the finitedimensional space constructed according to the methodology crossDHB of Delaigle et al. (2012).
LDA and QDA are calculated with classical moment estimates, and priors are estimated by the class portions in the training set. We include in our competitors as it is Bayesrisk consistent in the finitedimensional setting and generally performs very well in applications. The classifier is applied to the locationslope data in its affine invariant form. It is then defined as in (9), but with the Mahalanobis distance (determined from the pooled data) in place of the distance. is selected by crossvalidation.
As further competitors we consider three maximum depth classifiers. They are defined as
(21) 
with being either Mahalanobis depth , or spatial depth , or projection depth . denotes the prior probability for class . The priors are estimated by the class portions in the training set. This classifier is Bayes optimal if the data comes from an elliptical locationshift model with known priors. For technical and implementation details the reader is referred to Appendix A.
7.2 Simulation settings and results
Next we explore our methodology by applying it to the simulation setting of Cuevas et al. (2007). Their data are generated from two models, each having two classes. The first model is Model 1:
where is a Gaussian process with and , discretized at 51 equally distant points on (), see Figure 6 (left) for illustration. The functions are smooth and differ in mean only, which makes the classification task rather simple.
We take 100 observations (50 from each class) for training and 100 (50 from each class) for evaluating the performance of the classifiers. Training and classification are repeated 100 times to get stable results. Figure 7 (left) presents boxplots (over 100 takes) of error rates of twelve classifiers applied after transforming the data to properly constructed finitedimensional spaces. The top panel refers to a locationslope space, where and are selected by VapnikChervonenkis restricted crossvalidation (VCcrossLS), the middle panel to a locationslope space whose dimensions are determined by mere, unrestricted crossvalidation (crossLS), the bottom panel to the finitedimensional argument subspace constructed by the componentwise method crossDHB of (Delaigle et al., 2012). The classifiers are: linear (LDA) and quadratic (QDA) discriminant analysis, nearest neighbors classifier (), maximum depth classifier with Mahalanobis (MDM), spatial (MDS) and projection (MDP) depth, plot classifier with rule based on distance (M, S, P), and classifier (M, S, P), both with the three depths, correspondingly. The last approach (crossDHB) has not been combined with the projection depth for two reasons. First, performing the necessary crossvalidations with the projection depth becomes computationally infeasible; for computational times see Table 14 in Appendix B. Second, the quality of approximation of the projection depth differs between the tries, and this instability is possibly misleading when choosing the optimal argument subspace, thus yielding rather high error rates; compare, e.g., the classification errors for growth data, Table 1 in Section 7.3.
As expected, all plotbased classifiers, applied in a properly chosen locationslope space, show highly satisfactory performance, which is in line with the best result of Cuevas et al. (2007). The locationslope spaces that have been selected for the different classifiers, viz. the corresponding pairs, do not much differ; see Table 3 in Appendix B. The picture remains the same when the locationslope space is chosen using unrestricted crossvalidation crossLS, which yields no substantial improvement.
On the other hand, classifiers operating in an optimal argument subspace (crossDHB) are outperformed by those employed in the locationslope space (crossLS, VCcrossLS), although their error rates are still low; see Figure 7 (left). A plausible explanation could be that differences between the classes at each single argument point are not significant enough, and a bunch of them has to be taken for reasonable discrimination. But the sequential character of the procedure (as discussed in the Introduction and in Appendix A) prevents from choosing higher dimensions. Most often the dimensions two or three are chosen; see Table 11 in Appendix B. Our procedure appears to be better, as by integrating more information is extracted from the functions, so that they become distinguishable.
Next we consider an example, where averaging comes out to be rather a disadvantage. Model 2 of Cuevas et al. (2007) looks as follows:
with and as before. is an 8knot spline approximation of . See Figure 6 (right) for illustration.
The corresponding boxplots of the error rates are depicted in Figure 7 (right). The results for individual classifiers are different. When the locationslope space is chosen using VapnikChervonenkis bound (VCcrossLS), LDA, and all maximum depth classifiers perform poorly, while the classifiers with Mahalanobis and spatial depths perform better. The classifiers perform comparably. The last two lack efficiency when employed with projection depth; as seen from Table 4 in Appendix B, the efficient locationslope spaces are of substantially higher dimension (8 and more). Thus, the larger error rates with projection depth are explained by the insufficient number of random directions used in approximating this depth. Also, with projection depth, different to Mahalanobis and spatial depth, less efficient pairs are preferred; see Tables 4 and 8 in Appendix B.
Choosing the locationslope space by unrestricted crossvalidation (crossLS) does not change a lot. The error rates obtained with this locationslope space are larger than those obtained with the synthesized space (crossDHB), although with QDA and plotbased classifiers they stay reasonable. In Model 2, taking several extreme points would be enough for distinguishing the classes, and the finitedimensional spaces have most often dimension four, and sometimes three. (Note, that all plotbased classifiers regard these dimensions as sufficient, and, together with QDA, deliver best results.) On the other hand, the classifiers operating in some locationslope space select efficient dimension 8 and higher, which is also seen from Tables 4 and 8 in Appendix B.
7.3 Comparisons on benchmark data
Now we come back to the two benchmark problems given in the Introduction. The growth data have been already analyzed by several authors. LópezPintado & Romo (2006) achieve a best classification error of 16.13 % when classifying with the distance from the trimmed mean and trimming is based on the generalized band depth with trimming parameter . CuestaAlbertos & NietoReyes (2010) use an average distance weighting with the random Tukey depth and get classification error 13.68 %. Cuevas et al. (2007) obtain mean error rate of 9.04 % when using the double random projection depth and 4.04 % with , dividing the sample into 70 training and 23 testing observations over 70 tries. CuestaAlbertos et al. (2014) achieve 3.5 % when employing the LDAclassifier on a plot that is based on the mode depth applied to both location and slope. Sguera et al. (2014) get error rates of 3.45 % when classifying using kernelized functional spatial depth and choosing kernel parameter by means of crossvalidation.
growth data  medflies data  

Data set  VCcrossLS  crossLS  crossDHB  VCcrossLS  crossLS 
LDA  4.3  4.3  3.23  39.33  41.01 
QDA  4.3  5.38  7.53  38.58  42.7 
kNN  3.23  3.23  4.3  41.39  44.76 
MDM  5.38  5.38  5.38  38.95  40.64 
MDS  7.53  7.53  7.53  38.2  38.01 
MDP  8.6  7.53  9.68  44.01  43.82 
M  5.38  5.38  5.38  43.26  41.39 
S  4.3  4.3  7.53  39.33  41.95 
P  5.38  3.23  8.6  42.32  41.57 
M  5.38  5.38  3.23  37.83  38.58 
S  5.38  5.38  6.45  38.95  40.26 
P  6.45  6.45  6.45  35.02  35.96 
Table 1 (columns growth data) provides errors, estimated by leaveoneout crossvalidation, of different classification techniques. In this, either a proper locationslope space is based on VapnikChervonenkis bound (column VCcrossLS), on unrestricted crossvalidation (crossLS), or an optimal argument subset is chosen by the componentwise technique of Delaigle et al. (2012) (crossDHB). Note that with VCcrossLS classification by is best. It achieves error rate 3.23 %, which means here that only three observations are misclassified. It coincides with a best crossvalidation result of Baíllo & Cuevas (2008). Both plotbased classifiers perform well with all three depths, while the maximum depth classifiers MDS and MDP perform worse.
The VapnikChervonenkis restricted crossvalidation VCcrossLS seems to perform not much worse than the unrestricted crossvalidation crossLS, and it mostly outperforms the componentwise approach crossDHB. The latter is particularly bad when the projection depth is involved. Tables 5, 9 and 13 in Appendix B exhibit how often various pairs and dimensions of optimal argument subspace are chosen.
In general, all three spaceconstructing techniques allow for very low error rates, producing as little as three misclassifications if the classifier is properly chosen. The classifiers on spaces yield at most six misclassifications.
Compared to LópezPintado & Romo (2006) and CuestaAlbertos & NietoReyes (2010), the better performance of our classifiers appears to be due to the inclusion of average slopes. Observe that the acceleration period starting with approximately nine years discriminates particularly well between girls and boys; see Figure 2. Note that also the componentwise method prefers (location) values from this interval.
A much more involved real task is the classification of the medflies data. In Müller & Stadtmüller (2005) these data are analyzed by generalized functional linear models. The authors employ logit regression and semiparametric quasilikelihood regression. They get errors of 41.76 % and 41.2 %, respectively, also estimated by leaveoneout crossvalidation.
We apply all classifiers to the same data in properly constructed locationslope spaces. With our procedure we are able to improve the differential between long and shortlived flies. Especially, with the classifier based on projection depth an error of 35.02 % is obtained (see Table 1, columns captioned medflies data for the errors). The role of the derivatives in building the locationslope space is emphasized in Tables 6 and 10 in Appendix B, which show the frequencies at which the various pairs are selected. crossLS is outperformed in most of the cases. We were not able to compare the componentwise approach crossDHB as the computational load is too high. On an average, classifiers perform very satisfactory. LDA and QDA with VapnikChervonenkis bound, and maximumdepth classifiers with Mahalanobis and spatial depth (MDM, MDS), also deliver reasonable errors.
Note that, in configuring the locationslope space with crossLS, lower errors could be obtained by using finer (e.g. leaveoneout) crossvalidations. To make componentwise classification feasible and the comparison fair, we have used only 10fold crossvalidation in all procedures besides . ( in and plotbased has been determined by leaveoneout crossvalidation.) For exact implementation details the reader is referred to Appendix A.
7.4 Computational loads
Most methods of functional classification tend to be time consuming because of their needs for preprocessing, smoothing and parametertuning, but the literature on such methods usually does not discuss computation times. Nevertheless this is an important practical issue. Our procedure comes out to be particularly efficient due to three main reasons. Firstly, an eventual crossvalidation is restricted to very few iterations. Secondly, the depth space, where the final classification is done, has low dimension, which equals the number of classes. Thirdly, the linear interpolation requires no preprocessing or smoothing of the data.
To illustrate this we briefly present the durations of both training and classification phases for the two real data sets and the twelve classification techniques in Table 2. (Classification time of a single object is reported in parentheses below.) As the times depend on implementation and platform used, we also report (in square brackets) the numbers of crossvalidations done, as this measure is independent of the eventual classification technique. The training times have been obtained as the average over all runs of leaveoneout crossvalidation (thus using 92, respectively 533, observations for growth and medflies data). This comes very close to the time needed to train with the entire data set, as the difference of one observation is negligible. The classification times in Table 2 have been obtained in the same way, i.e. averaging the classification of each single observation over all runs of the leaveoneout crossvalidation. The same holds for the number of crossvalidating iterations. For the componentwise classifiers (crossDHB) the averages have been replaced by the medians for the following reason. The sequential character of the procedure causes an exponential increase of time with each iteration (in some range; see Appendix A for implementation details). Therefore, occasionally the total computation time can be outlying. (In our study, once the time exceeded four hours, viz. when classifying growth data with the plotbased classifier and projection depth, which required 20450 iterations to crossvalidate.) On the other hand, when employing faster classifiers (which usually require stronger assumption on the data) the training phase can take less than two minutes. (This has been pointed by Delaigle et al. (2012) as well.)
growth data  medflies data  

Data set  VCcrossLS  crossLS  crossDHB  VCcrossLS  crossLS 
LDA  2.73  12.42  150.78  9.87  29.36 
(0.002)  (0.0019)  (0.002)  (0.0021)  (0.0023)  
[5.39]  [150]  [3110]  [5]  [150]  
QDA  2.73  12.23  48.45  9.77  29.12 
(0.0017)  (0.0017)  (0.0019)  (0.0016)  (0.0027)  
[5.39]  [150]  [1020]  [5]  [150]  
kNN  2.91  26.38  213.36  23.87  814.46 
(0.001)  (0.001)  (0.0011)  (0.0024)  (0.0028)  
[5.39]  [150]  [2576]  [5]  [150]  
MDM  2.49  2.77  65.82  9.46  9.86 
(0.0009)  (0.0009)  (0.0004)  (0.0007)  (0.0007)  
[5.39]  [150]  [6920]  [5]  [150]  
MDS  2.63  5.47  168.14  10.04  35.89 
(0.0016)  (0.0017)  (0.0017)  (0.0017)  (0.0018)  
[5.39]  [150]  [6940]  [5]  [150]  
MDP  4.51  78.32  1795.86  19.86  439.03 
(0.0398)  (0.0481)  (0.0392)  (0.2247)  (0.2233)  
[5.39]  [150]  [4699]  [5]  [150]  
M  3.06  17.31  299.3  20.53  335.48 
(0.0012)  (0.0011)  (0.0012)  (0.002)  (0.002)  
[5.39]  [150]  [2856]  [5]  [150]  
S  3.61  36.31  631.97  24.13  551.36 
(0.0022)  (0.0021)  (0.0019)  (0.003)  (0.0033)  
[5.39]  [150]  [3135]  [5]  [150]  
P  6.65  143.42  3103. 57  41.56  1145.16 
(0.0308)  (0.0308)  (0.03)  (0.1995)  (0.1987)  
[5.39]  [150]  [4115]  [5]  [150]  
M  3.42  24.7  182.03  14.98  174.56 
(0.0009)  (0.0009)  (0.0011)  (0.001)  (0.001)  
[5.39]  [150]  [1020]  [5]  [150]  
S  4  42.49  860.14  18.71  392.61 
(0.0018)  (0.0017)  (0.0017)  (0.0018)  (0.0019)  
[5.39]  [150]  [3135]  [5]  [150]  
P  7.02  154.24  2598.38  36.04  983.61 
(0.0306)  (0.0309)  (0.0298)  (0.2)  (0.1995)  
[5.39]  [150]  [3135]  [5]  [150] 
With growth data training times are substantially higher when choosing an pair by unrestricted crossvalidation than when the VapnikChervonenkis bound is employed. Though, for the fast maximum depth classifier (with Mahalanobis or spatial depth) computation times appear quite comparable. Application of crossDHB causes an enormous increase in time (as well as in the number of crossvalidations needed). For medflies data, as expected, VCcrossLS is faster than crossLS. We were not able to perform the leaveoneout crossvalidation estimation for the componentwise method for this data set, because of its excessive computational load.
See also Table 14 for the same experiment regarding simulated data. Here, the projectiondepthbased classifiers have not been implemented at all, as they need too much computation time.
8 Conclusions
An efficient nonparametric procedure has been introduced for binary classification of functional data. The procedure consists of a twostep transformation of the original data plus a classifier operating on the unit square. The functional data are first mapped into a finitedimensional locationslope space and then transformed by a multivariate depth function into the plot, which is a subset of the unit square. Three alternative depth functions are employed for this, as well as two rules for the final classification on .
Our procedure either outperforms or matches existing approaches on simulated as well as on real benchmark data sets. The results of the plotbased and the procedure are generally good, although, (cf. Model 2) they are slightly outperformed by the componentwise classification method of Delaigle et al. (2012).
As the raw data are linearly interpolated, no spurious information is added. The core of our procedure is the new datadependent construction of the locationslope space. We bound its dimension by a VapnikChervonenkis bound. The subsequent depth transformation into the unit square makes the procedure rather robust since the final classification is done on a lowdimensional compact set.
Our use of statistical data depth functions demonstrates the variety of their application and opens new prospects when considering the proposed locationslope space. To reflect the dynamic structure of functional data, the construction of this space, in a natural way, takes levels together with derivatives into account. As it has been shown, efficient information extraction is done via piecewise averaging of the functional data in its raw form, while the changing of the functions with their argument is reflected by their average slopes.
The finitedimensional space has to be constructed in a way that respects the important intervals and includes most information. Here, equally spaced intervals are used that cover the entire domain but have possibly different numbers for location and slope. This gives sufficient freedom in configuring the locationslope space. Note, that in view of the simulations as well as the benchmark results, choosing a particular depth function is of limited relevance only. While, depending on given data, different intervals are differently relevant, location and slope may differ in information content as well. The set of reasonable locationslope spaces is enormously reduced by application of the VapnikChervonenkis bound, and the selection is done by fast crossvalidation over a very small set. The obtained finitedimensional space can be augmented by coordinates reflecting additional information on the data, that may be available. While our procedure is designed for the analysis of functions that are piecewise smooth, it can be also used for functions that are just piecewise continuous (by simply fixing the parameter at 0). Obviously, higher order derivatives can be involved, too. But obtaining those requires smooth extrapolation, which affords additional computations and produces possibly spurious information.
The new approach is presented here for classes, but it is not limited to this case. If , is applied in the dimensional depth space without changes, and the classifier is extended by either constructing oneagainstall or pairwise classifiers in the depth space, and finally performing some aggregation in the classification phase; (see also Lange et al., 2014a).
Our space selection technique (incomplete crossvalidation, restricted by a VapnikChervonenkis bound) is compared, both in terms of error rates and computational time, with a full crossvalidation as well as with the componentwise space synthesis method of Delaigle et al. (2012). We do this for all variants of the classifiers.
In future research comparisons with other existing functional classification techniques as well as the use of other finitedimensional classifiers on the plot are needed. Refined datadependent procedures, which size the relevant intervals and leave out irrelevant ones, may be developed to configure the locationslope space. However such refinements will possibly conflict with the efficiency and generality of the present approach.
Acknowledgement
We thank Dominik Liebl for his critical comments on an earlier version of the manuscript, as well as Ondrej Vencalek and Aurore Delaigle for their helpful remarks. The reading and suggestions of two referees are also gratefully acknowledged.
Appendix A Implementation details
In calculating the depths, and for the Mahalanobis depth have been determined by the usual moment estimates and similarly, for the spatial depth. The projection depth has been approximated by drawing 1 000 directions from the uniform distribution on the unit sphere. Clearly, the number of directions needed for satisfactory approximation depends on the dimension of the space. Observe that for higherdimensional problems 1 000 directions are not enough, which becomes apparent from the analysis of Model 2 in Section 7.2. There the locationslope spaces chosen have dimension eight and higher; see also Tables 4 and 8 in Appendix B. On the other hand, calculating the projection depth even in one dimension costs something. Computing 1 000 directions to approximate the projection depth takes substantially more time than computing the exact Mahalanobis or spatial depths (see Tables 2 and 14 in Appendix B).
LDA and QDA are used with classical moment estimates, and priors are estimated by the class portions in the training set. The classifier is applied to locationslope data in its affine invariant form, based on the covariance matrix of the pooled classes. For time reasons, its parameter is determined by leaveoneout crossvalidation over a reduced range, viz. . The procedure separating the plot uses polynomial space extensions with maximum degree three; the latter is selected by crossvalidation. To keep the training speed of the depthbased classifier comparable with that of the classifier, we also determine by leaveoneout crossvalidation on a reduced range of .
Due to linear interpolation, the levels are integrated as piecewiselinear functions, and the derivatives as piecewise constant ones. If the dimension of the locationslope space is too large (in particular for inverting the covariance matrix, as it can be the case in Model 2), PCA is used to reduce the dimension. Then is estimated and all further computations are performed in the subspace of principal components having positive loadings.
To construct the locationslope space, firstly all pairs satisfying are considered. ( amounts to 26 for the synthetic and to 16 for the real data sets.) For each the data are transformed to , and the VapnikChervonenkis bound is calculated. Then those five pairs are selected that have smallest . Here, tied values of are taken into account as well, with the consequence that on an average slightly more than five pairs are selected; see the growth data in Table 2 and both synthetic models in Table 14 of Appendix B. Finally, among these the best pair is chosen by means of crossvalidation. Note that the goal of this crossvalidation is not to actually choose a best locationslope dimension but rather to get rid of obviously misleading pairs, which may yield relatively small values of . This is seen from Figures 4 and 5. When determining an optimal pair by crossLS, the same set of pairs is considered as with VCcrossLS.
In implementing the componentwise method of finitedimensional space synthesis (crossDHB) we have followed Delaigle et al. (2012) with slight modifications. The original approach of Delaigle et al. (2012) is combined with the sequential approach of Ferraty et al. (2010). Initially, a grid of equally () distanced discretization points is built. Then a sequence of finitedimensional spaces is synthesized by adding points of the grid step by step. We start with all pairs of discretization points that have at least distance . (Note that Delaigle et al. (2012) start with single points instead of pairs.) The best of them is chosen by crossvalidation. Then step by step features are added. In each step, that point that has best discrimination power (again, in the sense of crossvalidation) when added to the already constructed set is chosen as a new feature. The resulting set of points is used to construct a neighborhood of combinations to be further considered. As a neighborhood we use twenty distanced points in the second step, and ten in the third; from the fourth step on the sequential approach is applied only.
All our crossvalidations are tenfold, except the leaveoneout crossvalidations in determining with both classifiers. Of course, partitioning the sample into ten parts only may depreciate our approach against a more comprehensive leaveoneout crossvalidation. We have chosen it to keep computation times of the crossDHB approach (Delaigle et al., 2012) in practical limits and also to make the comparison of approaches equitable throughout our study.
The calculations have been implemented in an Renvironment, based on the Rpackage “ddalpha” (Mozharovskyi et al., 2015), with speed critical parts written in C++. The Rcode implementing our methodology as well as that performing the experiments can be obtained upon request from the authors. In all experiments, one kernel of the processor Core i72600 (3.4 GHz) having enough physical memory has been used. Thus, regarding the methodology of Delaigle et al. (2012) our implementation differs from their original one and, due to its modulebased structure, may result in larger computation times. For this reason we provide the number of crossvalidations performed; see Tables 2 and 14 of Appendix B. The comparison appears to be fair, as we always use tenfold crossvalidation together with an identical set of classification rules in the finitedimensional spaces.
Appendix B Additional tables
Max.depth    

LDA  QDA  Mah.  Spt.  Prj.  Mah.  Spt.  Prj.  Mah.  Spt.  Prj.  
(2,1)  65  64  55  58  63  54  65  63  47  49  51  50 
(3,1)  15  23  29  34  23  29  16  24  34  29  26  31 
(2,0)  14  7  7  3  9  11  7  3  5  11  12  10 
(2,2)  1  2  5  0  1  3  4  2  6  3  1  6 
(3,0)  2  1  4  1  1  1  2  1  3  4  5  3 
Others  3  3  0  4  3  2  6  7  5  4  5  0 
Max.depth    

LDA  QDA 