Multivariate and functional classification
using depth and distance
Abstract
We construct classifiers for multivariate and functional
data.
Our approach is based on a kind of distance between data
points and classes.
The distance measure needs to be robust to outliers and
invariant to linear transformations of the data.
For this purpose we can use the bagdistance which
is based on halfspace depth.
It satisfies most of the properties of a norm but is able
to reflect asymmetry when the class is skewed.
Alternatively we can compute a measure of outlyingness
based on the skewadjusted projection depth.
In either case we propose the DistSpace transform
which maps each data point to the vector of its distances
to all classes, followed by nearest neighbor (kNN)
classification of the transformed data points.
This combines invariance and robustness with the simplicity
and wide applicability of kNN.
The proposal is compared with other methods in experiments
with real and simulated data.
1 Introduction
Supervised classification of multivariate data is a common statistical problem. One is given a training set of observations and their membership to certain groups (classes). Based on this information, one must assign new observations to these groups. Examples of classification rules include, but are not limited to, linear and quadratic discriminant analysis, nearest neighbors (kNN), support vector machines, and decision trees. For an overview see e.g. Hastie et al. (2009).
However, real data often contain outlying observations. Outliers can be caused by recording errors or typing mistakes, but they may also be valid observations that were sampled from a different population. Moreover, in supervised classification some observations in the training set may have been mislabeled, i.e. attributed to the wrong group. To reduce the potential effect of outliers on the data analysis and to detect them, many robust methods have been developed, see e.g. Rousseeuw and Leroy (1987) and Maronna et al. (2006).
Many of the classical and robust classification methods rely on distributional assumptions such as multivariate normality or elliptical symmetry (Hubert and Van Driessen, 2004). Most robust approaches that can deal with more general data make use of the concept of depth, which measures the centrality of a point relative to a multivariate sample. The first type of depth was the halfspace depth of Tukey (1975), followed by other depth functions such as simplicial depth (Liu, 1990) and projection depth (Zuo and Serfling, 2000).
Several authors have used depth in the context of classification. Christmann and Rousseeuw (2001) and Christmann et al. (2002) applied regression depth (Rousseeuw and Hubert, 1999). The maximum depth classification rule of Liu (1990) was studied by Ghosh and Chaudhuri (2005) and extended by Li et al. (2012). Dutta and Ghosh (2011) used projection depth.
In this paper we will present a novel technique called classification in distance space. It aims to provide a fully nonparametric tool for the robust supervised classification of possibly skewed multivariate data. In Sections 2 and 3 we will describe the key concepts needed for our construction. Section 4 discusses some existing multivariate classifiers and introduces our approach. A thorough simulation study for multivariate data is performed in Section 5. From Section 6 onwards we will focus our attention on the increasingly important framework of functional data, the analysis of which is a rapidly growing field. We will start by a general description, and then extend our work on multivariate classifiers to functional classifiers.
2 Multivariate depth and distance measures
2.1 Halfspace depth
If is a random variable on with distribution , then the halfspace depth of any point relative to is defined as the minimal probability mass contained in a closed halfspace with boundary through :
Halfspace depth satisfies the requirements of a statistical depth function as formulated by Zuo and Serfling (2000): it is affine invariant (i.e. invariant to translations and nonsingular linear transformations), it attains its maximum value at the center of symmetry if there is one, it is monotone decreasing along rays emanating from the center, and it vanishes at infinity.
For any statistical depth function and for any the depth region is the set of points whose depth is at least :
(1) 
The boundary of is known as the depth contour. The halfspace depth regions are closed, convex, and nested for increasing . Several properties of the halfspace depth function and its contours were studied in Massé and Theodorescu (1994) and Rousseeuw and Ruts (1999). The halfspace median (or Tukey median) is defined as the center of gravity of the smallest nonempty depth region, i.e. the region containing the points with maximal halfspace depth.
The finitesample definitions of the halfspace depth, the Tukey median and the depth regions are obtained by replacing by the empirical probability distribution . Many finitesample properties, including the breakdown value of the Tukey median, were derived in Donoho and Gasko (1992).
To compute the halfspace depth, several affine invariant algorithms have been developed. Rousseeuw and Ruts (1996) and Rousseeuw and Struyf (1998) provided exact algorithms in two and three dimensions and an approximate algorithm in higher dimensions. Recently Dyckerhoff and Mozharovskyi (2016) developed exact algorithms in higher dimensions. Algorithms to compute the halfspace median have been developed by Rousseeuw and Ruts (1998) and Struyf and Rousseeuw (2000). To compute the depth contours the algorithm of Ruts and Rousseeuw (1996) can be used in the bivariate setting, whereas the algorithms constructed by Hallin et al. (2010) and Paindaveine and Šiman (2012) are applicable to at least .
2.2 The bagplot
The bagplot of Rousseeuw et al. (1999) generalizes the univariate boxplot to bivariate data, as illustrated in Figure 1. The darkcolored bag is the smallest depth region with at least 50% probability mass, i.e. such that and for all . The white region inside the bag is the smallest depth region, which contains the halfspace median (plotted as a red diamond). The fence, which itself is rarely drawn, is obtained by inflating the bag by a factor 3 relative to the median, and the data points outside of it are flagged as outliers and plotted as stars. The lightcolored loop is the convex hull of the data points inside the fence.
The bagplot exposes several features of the bivariate data distribution: its center (by the Tukey median), its dispersion and shape (through the sizes and shape of the bag and the loop) and the presence or absence of outliers. In Figure 1 we see a moderate deviation from symmetry as well as several observations that lie outside the fence. One could extend the notion of bagplot to higher dimensions as well, but a graphical representation then becomes harder or impossible.
2.3 The bagdistance
Although the halfspace depth is small in outliers, it does not tell us how distant they are from the center of the data. Also note that any point outside the convex hull of the data has zero halfspace depth, which is not so informative. Based on the concept of halfspace depth, we can however derive a statistical distance of a multivariate point to as in (Hubert et al., 2015). This distance uses both the center and the dispersion of . To account for the dispersion it uses the bag defined above. Next, is defined as the intersection of the boundary of and the ray from the halfspace median through . The bagdistance of to is then given by the ratio of the Euclidean distance of to and the Euclidean distance of to :
(2) 
The denominator in (2) accounts for the dispersion of in the direction of . Note that the bagdistance does not assume symmetry and is affine invariant.
The finitesample definition is similar and illustrated in Figure 2 for the data set in Figure 1. Now the bag is shown in gray. For two new points and their Euclidean distance to the halfspace median is marked by dark blue lines, whereas the orange lines correspond to the denominator of (2) and reflect how these distances will be scaled. Here the lengths of the blue lines are the same (although they look different as the scales of the coordinates axes are quite distinct). On the other hand the bagdistance of is 7.43 and that of is only 0.62. These values reflect the position of the points relative to the sample, one lying quite far from the most central half of the data and the other one lying well within the central half.
Similarly we can compute the bagdistance of the outliers. For the uppermost right outlier with coordinates we obtain 4.21, whereas the bagdistance of the lower outlier is 3.18. Both distances are larger than 3, the bagdistance of all points on the fence, but the bagdistance now reflects the fact that the lower outlier is merely a boundary case. The upper outlier is more distant, but still not as remote as .
We will now provide some properties of the bagdistance. We define a generalized norm as a function such that and for , which satisfies for all and all . In particular, for a positive definite matrix it holds that
(3) 
is a generalized norm (and even a norm).
Now suppose we have a compact set which is starshaped about zero, i.e. for all and it holds that . For every we then construct the point as the intersection between the boundary of and the ray emanating from in the direction of . Let us assume that is in the interior of , that is, there exists such that the ball . Then whenever . Now define
(4) 
Note that we do not really need the Euclidean norm, as we can equivalently define as . We can verify that is a generalized norm, which need not be a continuous function. The following result shows more.
Theorem 1.
If the set is convex and compact and then the function defined in (4) is a convex function and hence continuous.
Proof. We need to show that
(5) 
for any and . In case are collinear the function restricted to this line is 0 in the origin and goes up linearly in both directions (possibly with different slopes) so (5) is satisfied for those and . If are not collinear they form a triangle. Note that we can write and and we will denote . We can verify that is a convex combination of and . By compactness of we know that , and from convexity of it then follows that . Therefore
so that finally
Note that this result generalizes Theorem 2 of Hubert et al. (2015) from halfspace depth to general convex sets. It follows that satisfies the triangle inequality since
Therefore (and thus the bagdistance) satisfies the conditions

for all

implies

for all and

for all .
This is almost a norm, in fact, it would become a norm if we were to add
The generalization makes it possible for to reflect asymmetric dispersion. (We could easily turn it into a norm by computing but then we would lose that ability.)
Also note that the function defined in (4) does generalize the Mahalanobis distance in (3), as can be seen by taking which implies for all so
Finally note that Theorem 1 holds whenever is a convex set. Instead of halfspace depth we could also use regions of projection depth or the depth function in Section 3. On the other hand, if we wanted to describe nonconvex data regions we would have to switch to a different starshaped set in (4).
In the univariate case, the compact convex set in Theorem 1 becomes a closed interval which we can denote by with , so that
In linear regression the minimization of yields the regression quantile of Koenker and Bassett (1978).
It is straightforward to extend Theorem 1 to a nonzero center by subtracting the center first.
To compute the bagdistance of a point with respect to a variate sample we can first compute the bag and then the intersection point . In low dimensions computing the bag is feasible, and it is worth the effort if the bagdistance needs to be computed for many points. In higher dimensions computing the bag is harder, and then a simpler and faster algorithm is to search for the multivariate point on the ray from through such that
(6) 
where are the data points. Since HD is monotone decreasing on the ray this can be done fairly fast, e.g. by means of the bisection algorithm.
Table 1 lists the computation time needed to calculate the bagdistance of points with respect to a sample of points in dimensions . For the algorithm of Ruts and Rousseeuw (1996) is used and (6) otherwise. The times are averages over 1000 randomly generated data sets. In each of the 1000 runs the points were generated from a centered multivariate normal distribution with a randomly generated covariance matrix. Note that the time for is essentially that of the right hand side of (6).
1  50  100  1000  

2  15.6  16.2  17.4  17.1 
3  34.8  67.8  84.1  310.2 
4  45.3  88.3  107.9  377.3 
5  56.4  106.3  128.2  432.8 
3 Skewadjusted projection depth
Since the introduction of halfspace depth various other affine invariant depth functions have been defined (for an overview see e.g. Mosler (2013)), among which projection depth (Zuo, 2003) which is essentially the inverse of the StahelDonoho outlyingness (SDO). The population SDO (Stahel, 1981; Donoho, 1982) of an arbitrary point with respect to a random variable with distribution is defined as
from which the projection depth is derived:
Since the SDO has an absolute deviation in the numerator and uses the MAD in its denominator it is best suited for symmetric distributions. For asymmetric distributions Brys et al. (2005) proposed the adjusted outlyingness (AO) in the context of robust independent component analysis. It is defined as
where the univariate adjusted outlyingness is given by
(7) 
Here
if , where and denote the first and third quartile of , and is robust measure of skewness (Brys et al., 2004). If we replace by . The denominator of (7) corresponds to the fence of the univariate adjusted boxplot proposed by Hubert and Vandervieren (2008).
The skewadjusted projection depth (SPD) is then given by (Hubert et al., 2015):
To compute the finitesample SPD we have to rely on approximate algorithms, as it is infeasible to consider all directions . A convenient affine invariant procedure is obtained by considering directions which are orthogonal to an affine hyperplane through randomly drawn data points. In our implementation we use directions. Table 2 shows the time needed to compute the AO (or SPD) of points with respect to a sample of points in dimensions , as in Table 1. Here the time for is the fixed cost of computing those directions and projecting the original data on them.
1  50  100  1000  

2  15.0  15.3  15.6  20.9 
3  23.2  23.9  23.5  31.3 
4  30.5  30.9  31.6  41.7 
5  38.4  39.1  40.0  52.2 
4 Multivariate classifiers
4.1 Existing methods
One of the oldest nonparametric classifiers is the nearest neighbor (kNN) method introduced by Fix and Hodges (1951). For each new observation the method looks up the training data points closest to it (typically in Euclidean distance), and then assigns it to the most prevalent group among those neighbors. The value of is typically chosen by crossvalidation to minimize the misclassification rate.
Liu (1990) proposed to assign a new observation to the group in which it has the highest depth. This MaxDepth rule is simple and can be applied to more than two groups. On the other hand it often yields ties when the depth function is identically zero on large domains, as is the case with halfspace depth and simplicial depth. Dutta and Ghosh (2011) avoided this problem by using projection depth instead, whereas Hubert and Van der Veeken (2010) employed the skewadjusted projection depth.
To improve on the MaxDepth rule, Li et al. (2012) introduced the DepthDepth classifier as follows. Assume that there are two groups, and denote the empirical distributions of the training groups as and . Then transform any data point to the bivariate point
(8) 
where depth is a statistical depth function. These bivariate points form the socalled depthdepth plot, in which the two groups of training points are colored differently. The classification is then performed on this plot. The MaxDepth rule corresponds to separating according to the 45 degree line through the origin, but in general Li et al. (2012) calculate the best separating polynomial. Next, they assign a new observation to group 1 if it lands above the polynomial, and to group 2 otherwise. Some disadvantages of the depthdepth rule are the computational complexity of finding the best separating polynomial and the need for majority voting when there are more than two groups. Other authors carry out a depth transform followed by linear classification (Lange et al., 2014) or kNN (CuestaAlbertos et al., 2015) instead.
4.2 Classification in distance space
It has been our experience that distances can be very useful in classification, but we prefer not to give up the affine invariance that depth enjoys. Therefore, we propose to use the bagdistance of Section 2.3 for this purpose, or alternatively the adjusted outlyingness of Section 3. Both are affine invariant, robust against outliers in the training data, and suitable also for skewed data.
Suppose that groups (classes) are given, where . Let represent the empirical distribution of the training data from group . Instead of the depth transform (8) we now carry out a distance transform by mapping each point to the variate point
(9) 
where is a generalized distance or an outlyingness measure of the point to the th training sample. Note that the dimension may be lower, equal, or higher than the original dimension . After the distance transform any multivariate classifier may be applied, such as linear or quadratic discriminant analysis. The simplest version is of course MinDist, which just assigns to the group with smallest coordinate in (9). When using the StahelDonoho or the adjusted outlyingness, this is equivalent to the MaxDepth rule based on projection depth or skewadjusted projection depth. However, we prefer to apply kNN to the transformed points. This combines the simplicity and robustness of kNN with the affine invariance offered by the transformation. Also note that we never need to resort to majority voting. In the simulations in Section 5 we will see that the proposed DistSpace method (i.e. the distance transform (9) followed by kNN) works quite well.
We now illustrate the distance transform on a real world example, available from the UCI Machine Learning Repository (Bache and Lichman, 2013). The data originated from an authentication procedure of banknotes. Photographs of 762 genuine and 610 forged banknotes were processed using wavelet transformations, and four features were extracted. These are the 4 coordinates shown in the scatterplot matrix in Figure 3.
Note that . Using the bagdistance, the distance space of this data is Figure 4. It shows that forged and authentic banknotes are wellseparated and that the authentic banknotes form a tight cluster compared to that of the forged ones. Any new banknote would yield a new point in this plot, allowing kNN to classify it.
5 Computational results
To evaluate the various classifiers we apply them to simulated and real data. Their performance is measured by their average misclassification percentage with the percentage of misclassified observations of group in the test set, the number of observations of group in the training set, and the total size of the training set. This weights the misclassification percentages in the test set according to the prior probabilities. In each scenario the test set consists of 500 observations per group. This procedure is repeated 2000 times for each setting.
Setting 1: Trivariate normals (). We generate data from three different normal distributions. The first group has parameters
The second group is generated like but we flip the sign of the second coordinate. The third group is again generated like but then shifted by the vector . The training data consist of 50 observations in each group.
Setting 2: Multivariate normal and skewed (). We consider two 6variate distributions. The first group is drawn from the standard normal distribution. The coordinates in the second group are independent draws from the exponential distribution with rate parameter 1:
The training data has 150 observations drawn from group and 100 from .
Setting 3: Concentric distributions (). This consists of two groups of data. The first group is drawn from the standard normal distribution. The second group is obtained by generating points on the unit sphere in and multiplying them by lengths which are generated uniformly on . The training data has 150 observations from group and 250 from .
Setting 4: Banknote authentication data (). We first standardize the data by the columnwise median and MAD. The training sets are random subsets of 500 points from the original data set, with the test sets each time consisting of the remaining 872 observations.
Among the depthbased classification rules, halfspace depth (HD) is compared to projection depth (PD) and skewadjusted projection depth (SPD). We run the MaxDepth rule, DepthDepth followed by the best separating polynomial and DepthDepth followed by kNN. The degree of the polynomial and the number of neighbors are selected based on leaveoneout crossvalidation.
Among the distancebased classifiers, the bagdistance based on halfspace depth (bd ) is compared to the StahelDonoho outlyingness (SDO) and the adjusted outlyingness (AO). Here the MinDist and DistSpace classifiers are considered.
We evaluate all classifiers on the uncontaminated data, and on data where 5% and 10% of the observations in each group are mislabeled by assigning them randomly to another group. Figures 58 summarize the results with boxplots of the misclassification percentages.
In setting 1, most of the depth and distancebased methods did better than kNN. The halfspace depth HD did not perform well in MaxDepth and DepthDepth + kNN, and in fact mislabeling improved the classification because it yielded fewer points with depth zero in both groups. Halfspace depth appeared to work better in DepthDepth + polynomial but this is due to the fact that whenever a point has depth zero in both groups, Li et al. (2012) fall back on kNN in the original data space. Also note that DepthDepth + polynomial by construction improves the MaxDepth rule on training data, but it doesn’t always perform better on test data.
In setting 2 we note the same things about HD in the depthbased methods. The best results are obtained by DepthDepth + poly and DistSpace, where we note that the methods that are able to reflect skewness (HD, SPD, bd, AO) did a lot better than those that aren’t (PD, SDO). This is because the data contains a skewed group.
In the third setting one of the groups is not convex at all, and the MaxDepth and MinDist boxplots lie entirely above the figure. On the other hand the DepthDepth and DistSpace methods still see structure in the data, and yield better results than kNN on the original data.
In the banknote authentication example (setting 4), all methods except HD work well. For clean data, the two methods using the bagdistance outperform all others.
6 Functional data
The analysis of functional data is a booming research area of statistics, see e.g. the books of Ramsay and Silverman (2005) and Ferraty and Vieu (2006). A functional data set typically consists of curves observed at time points . The value of a curve at a given time point is a variate vector of measurements. We call the functional dataset univariate or multivariate depending on . For instance, the multilead ECG data set analyzed by Pigoli and Sangalli (2012) is multivariate with .
When faced with classification of functional data, one approach is to consider it as multivariate data in which the measurement(s) at different time points are separate variables. This yields highdimensional data with typically many highly correlated variables, which can be dealt with by penalization (Hastie et al., 1995). Another approach is to project such data onto a lowerdimensional subspace and to continue with the projected data, e.g. by means of support vector machines (Rossi and Villa, 2006; MartinBarragan et al., 2014). Li and Yu (2008) proposed to use statistics to select small subintervals in the domain and to restrict the analysis to those. Other techniques include the weighted distance method of Alonso et al. (2012) and the componentwise approach of Delaigle et al. (2012).
To reflect the dynamic behavior of functional data one can add their derivatives or integrals to the analysis, and/or add some preprocessing functions (warping functions, baseline corrections, …) as illustrated in Claeskens et al. (2014). This augments the data dimension and may add valuable information that can be beneficial in obtaining a better classification. We will illustrate this on a real data set in Section 7.3.
The study of robust methods for functional data started only recently. So far, efforts to construct robust classification rules for functional data have mainly used the concept of depth: LópezPintado and Romo (2006) used the modified band depth, CuestaAlbertos and NietoReyes (2010) made use of random Tukey depth, and Hlubinka et al. (2015) compared several depth functions in this context.
6.1 Functional depths and distances
Claeskens et al. (2014) proposed a type of multivariate functional depth (MFD) as follows. Consider a variate stochastic process , a statistical depth function on , and a weight function on integrating to 1. Then the MFD of a curve on with respect to the distribution is defined as
(10) 
where is the distribution of at time . The weight function allows to emphasize or downweight certain time regions, but in this paper will be assumed constant. The functional median is defined as the curve with maximal MFD. Properties of the MFD may be found in (Claeskens et al., 2014), with emphasis on the case where is the halfspace depth. Several consistency results are derived in (Nagy et al., 2016).
For ease of notation and to draw quick parallels to the multivariate nonfunctional case, we will denote the MFD based on halfspace depth by fHD, and the MFD based on projection depth and skewadjusted projection depth by fPD and fSPD.
Analogously, we can define the functional bagdistance (fbd) of a curve to (the distribution of) a stochastic process as
(11) 
Similar extensions of the StahelDonoho outlyingness SDO and the adjusted outlyingness AO to the functional context are given by
(12)  
(13) 
6.2 Functional classifiers
The classifiers discussed in Section 4 are readily adapted to functional data. By simply plugging in the functional versions of the distances and depths all procedures can be carried over. For the nearest neighbor method one typically uses the distance:
The functional kNN method will be denoted as fkNN. It is simple but not affine invariant. Analogously we use the MaxDepth and DepthDepth rules based on fHD, fPD, and fSPD, as well as the MinDist and DistSpace rules based on fbd, fSDO, and fAO. Note that Mosler and Mozharovskyi (2016) already studied DepthDepth on functional data after applying a dimension reduction technique.
7 Functional data examples
7.1 Fighter plane dataset
The fighter plane dataset of Thakoor and Gao (2005) describes 7 shapes: of the Mirage, Eurofighter, F14 with wings closed, F14 with wings opened, Harrier, F22 and F15. Each class contains 30 shape samples obtained from digital pictures, which Thakoor and Gao (2005) then reduced to the univariate functions in Figure 9. We obtained the data from the UCR Time Series Classification Archive (Chen et al., 2015).
In all, the plane data set consists of 210 observations divided among 7 groups. For the training data we randomly drew 15 observations from each group, and the test data were the remaining 105 observations. Repeating this 200 times yielded the misclassification percentages in Figure 10.
In this data set the DistSpace method performed best, followed by kNN which however suffered under 10% of mislabeling. Figure 10 contains no panel for DepthDepth + poly because the computation time of this method was infeasible due to the computation of the separating polynomials combined with majority voting for .
7.2 MRI dataset
Felipe et al. (2005) obtained intensities of MRI images of 9 different parts of the human body (plus a group consisting of all remaining body regions, which was of course very heterogeneous). They then transformed their data to curves. This data set was also downloaded from (Chen et al., 2015). The classes together contain 547 observations and are of unequal size. For example . The curves for 4 of these classes are shown in Figure 11 (if we plot all 9 groups together, some become invisible).
For the training data we drew unequally sized random subsets from these groups. The misclassification rates of 200 experiments of this type are shown in Figure 12.
Here DistSpace performs a bit better than fKNN under contamination, and much better than MaxDepth and MinDist. Also in this example the DepthDepth + poly method took too long to compute.
7.3 Writing dataset
The writing dataset consists of 2858 character samples corresponding to the speed profile of the tip of a pen writing different letters, as captured on a WACOM tablet. The data came from the UCI Machine Learning Repository (Bache and Lichman, 2013). We added the  and coordinates of the pen tip (obtained by integration) to the data, yielding overall unlike both previous examples which had . We further processed the data by removing the first and last time points and by interpolating to give all curves the same time domain. Samples corresponding to the letters ‘a’, ‘c’, ‘e’, ‘h’ and ‘m’ were retained. This yields a fivegroup supervised classification problem of fourdimensional functional data. Figure 13 plots the curves, with the 5 groups shown in different colors.
For each letter the training set was a random subset of 80 multivariate curves. The outcome is in Figure 14. There is no panel for the DepthDepth + poly classifier with separating polynomials and majority voting as its computation time was infeasible. MaxDepth and DepthDepth combined with kNN perform well except for fHD, again due to the fact that HD is zero outside the convex hull. DistSpace outperforms MinDist, and works well with all three distances. The best result was obtained by DistSpace with fbd.
Finally we applied fkNN and DistSpace to the original twodimensional velocity data only. This resulted in larger median misclassification errors for all methods and all 3 data settings (0%, 5% and 10% mislabeling). For example, DistSpace with fbd on the twodimensional data yielded a median misclassification error of 0.35%, whereas the median error was zero on the 4dimensional augmented data. This shows that adding appropriate databased functional information can be very useful to better separate groups.
8 Conclusions
Existing classification rules for multivariate or functional data, like kNN, often work well but can fail when the dispersion of the data depends strongly on the direction in which it is measured. The MaxDepth rule of Liu (1990) and its DepthDepth extension (Li et al., 2012) resolve this by their affine invariance, but perform poorly in combination with depth functions that become zero outside the convex hull of the data, like halfspace depth (HD).
This is why we prefer to use the bagdistance bd, which is based on HD and has properties very close to those of a norm but is able to reflect skewness (while still assuming some convexity). Rather than transforming the data to their depths we propose the distance transform, based on bd or a measure of outlyingness such as SDO or AO.
After applying the depth or distance transforms there are many possible ways to classify the transformed data. We found that the original separating polynomial method did not perform the best. Therefore we prefer to apply kNN to the transformed data.
In our experiments with real and simulated data we found that the best performing methods overall were DepthDepth + kNN (except with halfspace depth) and DistSpace + kNN. The latter approach combines affine invariance with the computation of a distance and the simplicity, lack of assumptions, and robustness of kNN, and works well for both multivariate and functional data.
In the multivariate classification setting the depth and distance transforms perform about equally well, and in particular MinDist on SDO and AO is equivalent to MaxDepth on the corresponding depths PD and SPD. But the bagdistance bd beats the halfspace depth HD in this respect because the latter is zero outside the convex hull of a group.
One of the most interesting results of our simulations is that the depth and distance transforms are less similar in the functional setting. Indeed, throughout Section 7 the distance transform outperformed the depth transform. This is because distances are more additive than depths, which matters because of the integrals in the definitions of functional depth MFD (10) to functional AO (13). For the sake of simplicity, let us focus on the empirical versions where the integrals in (10) to (13) become sums over a finite number of observed time points. These sums are norms (we could also use norms by taking the square root of the sum of squares). In the context of classification, we are measuring how different a new curve is from a process or a finite sample from it. When differs strongly from in a few time points, the integrated depth (10) will have a few terms equal to zero or close to zero, which will not lead to an extremely small sum, so would appear quite similar to . On the other hand, a functional distance measure like (11)–(13) will contain a few very large terms, which will have a large effect on the sum, thereby revealing that is quite far from . In other words, functional distance adds up information about how distinct is from . The main difference between the two approaches is that the depth terms are bounded from below (by zero), whereas distance terms are unbounded from above and thus better able to reflect discrepancies.
References
 Alonso et al. (2012) Alonso, A., Casado, D., and Romo, J. (2012). Supervised classification for functional data: a weighted distance approach. Computational Statistics & Data Analysis, 56:2334–2346.
 Bache and Lichman (2013) Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets.html
 Brys et al. (2005) Brys, G., Hubert, M., and Rousseeuw, P. J. (2005). A robustification of independent component analysis. Journal of Chemometrics, 19:364–375.
 Brys et al. (2004) Brys, G., Hubert, M., and Struyf, A. (2004). A robust measure of skewness. Journal of Computational and Graphical Statistics, 13:996–1017.
 Chen et al. (2015) Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.J. (2015). The UCR Time Series Classification Archive. www.cs.ucr.edu/eamonn/time_series_data/
 Christmann et al. (2002) Christmann, A., Fischer, P., and Joachims, T. (2002). Comparison between various regression depth methods and the support vector machine to approximate the minimum number of misclassifications. Computational Statistics, 17:273–287.
 Christmann and Rousseeuw (2001) Christmann, A. and Rousseeuw, P. J. (2001). Measuring overlap in logistic regression. Computational Statistics & Data Analysis, 37:65–75.
 Claeskens et al. (2014) Claeskens, G., Hubert, M., Slaets, L., and Vakili, K. (2014). Multivariate functional halfspace depth. Journal of the American Statistical Association, 109(505):411–423.
 CuestaAlbertos and NietoReyes (2010) CuestaAlbertos, J.A. and NietoReyes, A. (2010). Functional classification and the random Tukey depth: Practical issues. In: Borgelt, C., Rodríguez, G.G., Trutschnig, W., Lubiano, M.A., Angeles Gil, M., Grzegorzewski, P. and Hryniewicz, O. (eds.) Combining Soft Computing and Statistical Methods in Data Analysis. Springer, Berlin Heidelberg, pages 123–130.
 CuestaAlbertos et al. (2015) CuestaAlbertos, J.A., FebreroBande, M., and Oviedo de la Fuente, M. (2015). The classifier in the functional setting. arXiv:1501.00372v2.
 Delaigle et al. (2012) Delaigle, A., Hall, P., and Bathia, N. (2012). Componentwise classification and clustering of functional data. Biometrika, 99:299–313.
 Donoho (1982) Donoho, D. (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying paper, Dept. Statistics, Harvard University, Boston.
 Donoho and Gasko (1992) Donoho, D. and Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. The Annals of Statistics, 20(4):1803–1827.
 Dutta and Ghosh (2011) Dutta, S. and Ghosh, A. (2011). On robust classification using projection depth. Annals of the Institute of Statistical Mathematics, 64:657–676.
 Dyckerhoff and Mozharovskyi (2016) Dyckerhoff, R. and Mozharovskyi, P. (2016). Exact computation of the halfspace depth. Computational Statistics & Data Analysis, 98:19–30.
 Ferraty and Vieu (2006) Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York.
 Felipe et al. (2005) Felipe, J.C., Traina, A.J.M., Traina, C. (2005). Global warp metric distance: boosting contentbased image retrieval through histograms. Proceedings of the Seventh IEEE International Symposium on Multimedia (ISM’05), p.8.
 Fix and Hodges (1951) Fix, E. and Hodges, J. L. (1951). Discriminatory analysis  nonparametric discrimination: Consistency properties. Technical Report 4 USAF School of Aviation Medicine, Randolph Field, Texas.
 Ghosh and Chaudhuri (2005) Ghosh, A. and Chaudhuri, P. (2005). On maximum depth and related classifiers. Scandinavian Journal of Statistics, 32(2):327–350.
 Hallin et al. (2010) Hallin, M., Paindaveine, D., and Šiman, M. (2010). Multivariate quantiles and multipleoutput regression quantiles: from optimization to halfspace depth. The Annals of Statistics, 38(2):635–669.
 Hastie et al. (1995) Hastie, T., Buja, A., and Tibshirani, R. (1995). Penalized discriminant analysis. The Annals of Statistics, 23(1):73–102.
 Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, New York, second edition.
 Hlubinka et al. (2015) Hlubinka, D., Gijbels, I., Omelka, M., and Nagy, S. (2015). Integrated data depth for smooth functions and its application in supervised classification. Computational Statistics, 30:1011–1031.
 Hubert et al. (2015) Hubert, M., Rousseeuw, P. J., and Segaert, P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24:177–202.
 Hubert and Van der Veeken (2010) Hubert, M. and Van der Veeken, S. (2010). Robust classification for skewed data. Advances in Data Analysis and Classification, 4:239–254.
 Hubert and Vandervieren (2008) Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 52(12):5186–5201.
 Hubert and Van Driessen (2004) Hubert, M. and Van Driessen, K. (2004). Fast and robust discriminant analysis. Computational Statistics & Data Analysis, 45:301–320.
 Koenker and Bassett (1978) Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46:33–50.
 Lange et al. (2014) Lange, T., Mosler, K., and Mozharovskyi, P. (2014). Fast nonparametric classification based on data depth. Statistical Papers, 55(1):49–69.
 Li and Yu (2008) Li, B. and Yu, Q. (2008). Classification of functional data: A segmentation approach. Computational Statistics & Data Analysis, 52(10):4790 – 4800.
 Li et al. (2012) Li, J., CuestaAlbertos, J., and Liu, R. (2012). DDclassifier: nonparametric classification procedure based on DDplot. Journal of the American Statistical Association, 107:737–753.
 Liu (1990) Liu, R. (1990). On a notion of data depth based on random simplices. The Annals of Statistics, 18(1):405–414.
 LópezPintado and Romo (2006) LópezPintado, S. and Romo, J. (2006). Depthbased classification for functional data. In Data depth: Robust Multivariate Analysis, Computational Geometry and Applications, volume 72 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci., pages 103–119. Amer. Math. Soc., Providence, RI.
 Maronna et al. (2006) Maronna, R., Martin, D., and Yohai, V. (2006). Robust Statistics: Theory and Methods. Wiley, New York.
 MartinBarragan et al. (2014) MartinBarragan, B., Lillo, R., and Romo, J. (2014). Interpretable support vector machines for functional data. European Journal of Operational Research, 232(1):146–155.
 Massé and Theodorescu (1994) Massé, J.C. and Theodorescu, R. (1994). Halfplane trimming for bivariate distributions. Journal of Multivariate Analysis, 48(2):188–202.
 Mosler (2013) Mosler, K. (2013). Depth statistics. In Becker, C., Fried, R., and Kuhnt, S., editors, Robustness and Complex Data Structures, Festschrift in Honour of Ursula Gather, pages 17–34, Berlin, Springer.
 Mosler and Mozharovskyi (2016) Mosler, K. and Mozharovskyi, P. (2016). Fast DDclassification of functional data. Statistical Papers, doi: 10.1007/s0036201507383.
 Nagy et al. (2016) Nagy, S., Gijbels, I., Omelka, M., and Hlubinka, D. (2016). Integrated depth for functional data: statistical properties and consistency. ESAIM Probability and Statistics, doi: 10.1051/ps/2016005.
 Paindaveine and Šiman (2012) Paindaveine, D. and Šiman, M. (2012). Computing multipleoutput regression quantile regions. Computational Statistics & Data Analysis, 56:840–853.
 Pigoli and Sangalli (2012) Pigoli, D. and Sangalli, L. (2012). Wavelets in functional data analysis: estimation of multidimensional curves and their derivatives. Computational Statistics & Data Analysis, 56(6):1482–1498.
 Ramsay and Silverman (2005) Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer, New York, 2nd edition.
 Rossi and Villa (2006) Rossi, F. and Villa, N. (2006). Support vector machine for functional data classification. Neurocomputing, 69:730–742.
 Rousseeuw and Hubert (1999) Rousseeuw, P. J. and Hubert, M. (1999). Regression depth. Journal of the American Statistical Association, 94:388–402.
 Rousseeuw and Leroy (1987) Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression and Outlier Detection. WileyInterscience, New York.
 Rousseeuw and Ruts (1996) Rousseeuw, P. J. and Ruts, I. (1996). Bivariate location depth. Applied Statistics, 45:516–526.
 Rousseeuw and Ruts (1998) Rousseeuw, P. J. and Ruts, I. (1998). Constructing the bivariate Tukey median. Statistica Sinica, 8:827–839.
 Rousseeuw and Ruts (1999) Rousseeuw, P. J. and Ruts, I. (1999). The depth function of a population distribution. Metrika, 49:213–244.
 Rousseeuw et al. (1999) Rousseeuw, P. J., Ruts, I., and Tukey, J. (1999). The bagplot: a bivariate boxplot. The American Statistician, 53:382–387.
 Rousseeuw and Struyf (1998) Rousseeuw, P. J. and Struyf, A. (1998). Computing location depth and regression depth in higher dimensions. Statistics and Computing, 8:193–203.
 Ruts and Rousseeuw (1996) Ruts, I. and Rousseeuw, P. J. (1996). Computing depth contours of bivariate point clouds. Computational Statistics & Data Analysis, 23:153–168.
 Stahel (1981) Stahel, W. (1981). Robuste Schätzungen: infinitesimale Optimalität und Schätzungen von Kovarianzmatrizen. PhD thesis, ETH Zürich.
 Struyf and Rousseeuw (2000) Struyf, A. and Rousseeuw, P. J. (2000). Highdimensional computation of the deepest location. Computational Statistics & Data Analysis, 34(4):415–426.
 Thakoor and Gao (2005) Thakoor, N. and Gao, J. (2005). Shape classifier based on generalized probabilistic descent method with hidden Markov descriptor. Tenth IEEE International Conference on Computer Vision (ICCV 2005), Vol. 1: 495–502.
 Tukey (1975) Tukey, J. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Volume 2, pages 523–531, Vancouver.
 Zuo (2003) Zuo, Y. (2003). Projectionbased depth functions and associated medians. The Annals of Statistics, 31(5):1460–1490.
 Zuo and Serfling (2000) Zuo, Y. and Serfling, R. (2000). General notions of statistical depth function. The Annals of Statistics, 28:461–482.