Convergence Rates for Empirical Estimation of Binary Classification Bounds
Bounding the best achievable error probability for binary classification problems is relevant to many applications including machine learning, signal processing, and information theory. Many bounds on the Bayes binary classification error rate depend on information divergences between the pair of class distributions. Recently, the Henze-Penrose (HP) divergence has been proposed for bounding classification error probability. We consider the problem of empirically estimating the HP-divergence from random samples. We derive a bound on the convergence rate for the Friedman-Rafsky (FR) estimator of the HP-divergence, which is related to a multivariate runs statistic for testing between two distributions. The FR estimator is derived from a multicolored Euclidean minimal spanning tree (MST) that spans the merged samples. We obtain a concentration inequality for the Friedman-Rafsky estimator of the Henze-Penrose divergence. We validate our results experimentally and illustrate their application to real datasets.
Divergence measures between probability density functions are used in many signal processing applications including classification, segmentation, source separation, and clustering (see [1, 2, 3]). For more applications of divergence measures, we refer to .
In classification problems, the Bayes error rate is the expected risk for the Bayes classifier, which assigns a given feature vector to the class with the highest posterior probability. The Bayes error rate is the lowest possible error rate of any classifier for a particular joint distribution. Mathematically, let be realizations of random vector and class labels , with prior probabilities and , such that . Given conditional probability densities and , the Bayes error rate is given by
The Bayes error rate provides a measure of classification difficulty. Thus when known, the Bayes error rate can be used to guide the user in the choice of classifier and tuning parameter selection. In practice, the Bayes error is rarely known and must be estimated from data. Estimation of the Bayes error rate is difficult due to the nonsmooth function within the integral in (1). Thus, research has focused on deriving tight bounds on the Bayes error rate based on smooth relaxations of the function. Many of these bounds can be expressed in terms of divergence measures such as the Bhattacharyya  and Jensen-Shannon . Tighter bounds on the Bayes error rate can be obtained using an important divergence measure known as the Henze-Penrose (HP) divergence [7, 8].
Many techniques have been developed for estimating divergence measures. These methods can be broadly classified into two categories: (i) plug-in estimators in which we estimate the probability densities and then plug them in the divergence function, [9, 10, 11, 12] (ii) entropic graph approaches, in which the relationship between the divergence function and a graph functional in Euclidean space is derived, , . Examples of plug-in methods include k-nearest neighbor (K-NN) and Kernel density estimator (KDE) divergence estimators. Examples of entropic graph approaches include methods based on minimal spanning trees (MST), K-nearest neighbors graphs (K-NNG), minimal matching graphs (MMG), traveling salesman problem (TSP), and their power-weighted variants.
Disadvantages of plug-in estimators are that these methods often require assumptions on the support set boundary and are more computationally complex than direct graph-based approaches. Thus for practical and computational reasons, the asymptotic behavior of entropic graph approaches has been of great interest. Asymptotic analysis has been used to justify graph based approaches. For instance in , the authors showed that a cross match statistic based on optimal weighted matching converges to the the HP-divergence. In , a more complex approach based on the K-NNG was proposed that also converges to the HP-divergence.
The first contribution of our paper is that we obtain a bound on the convergence rates for the Friedman and Rafsky (FR) estimator of the HP-divergence, which is based on a multivariate extension of the non-parametric run length test of equality of distributions. This estimator is constructed using a multicolored MST on the labeled training set where MST edges connecting samples with dichotomous labels are colored differently from edges connecting identically labeled samples. While previous works have investigated the FR test statistic in the context of estimating the HP-divergence (see [8, 16]), to the best of our knowledge its minimax MSE convergence rate has not been previously derived. The bound on convergence rate is established by using the umbrella theorem of , for which we define a dual version of the multicolor MST. The proposed dual MST in this work is different than the standard dual MST introduced by Yukich in . We show that the bias rate of the FR estimator is bounded by a function of , and , as , where is the total sample size, is the dimension of the data samples , and is the Hölder smoothness parameter . We also obtain the variance rate bound as .
The second contribution of our paper is a new concentration bound for the FR test statistic. The bound is obtained by establishing a growth bound and a smoothness condition for the multicolored MST. Since the FR test statistic is not a Euclidean functional we cannot use the standard subadditivity and superadditivity approaches of [17, 18, 19]. Our concentration inequality is derived using a different Hamming distance approach and a dual graph to the multicolored MST.
We experimentally validate our theoretic results. We compare the MSE theory and simulation in three experiments with various dimensions . We observe that in all three experiments as sample size increases the MSE rate decreases and for higher dimension the rate is slower. In all sets of experiments our theory matches the experimental results. Furthermore, we illustrate the application of our results on estimation of the Bayes error rate on three real datasets.
I-a Related work
Much research on minimal graphs has focused on the use of Euclidean functionals for signal processing and statistics applications such as image registration , , pattern matching  and non-parametric divergence estimation . A K-NNG-based estimator of Rényi and -divergence measures has been proposed in . Additional examples of direct estimators of divergence measures include statistic based on the nonparametric two sample problem, the Smirnov maximum deviation test  and the Wald-Wolfowitz  runs test, which have been studied in .
Many entropic graph estimators such as MST, K-NNG, MMG and TSP have been considered for multivariate data from a single probability density . In particular, the normalized weight function of graph constructions all converge almost surely to the Rényi entropy of , [28, 17]. For uniformly distributed points, the MSE is [29, 30]. Later Hero et al. ,  reported bounds on -norm bias convergence rates of power-weighted Euclidean weight functionals of order for densities belonging to the space of Hölder continuous functions as , where , , , and . We derive a bound on convergence rates when the density functions belong to the strong Hölder class, , for , . Note that throughout the paper we assume the density functions are absolutely continuous and bounded with support on the unit cube .
In , Yukich introduced the general framework of continuous and quasi-additive Euclidean functionals. This has led to many convergence rate bounds of entropic graph divergence estimators.
The framework of  is as follows: Let be finite subset of points in , , drawn from an underlying density. A real-valued function defined on is called a Euclidean functional of order if it is of the form , where is a set of graphs, is an edge in the graph , is the Euclidean length of , and is called the edge exponent or power-weighting constant. The MST, TSP, and MMG are some examples for which .
Following this framework, we show that the FR test statistic satisfies the required continuity and quasi-additivity properties to obtain similar convergence rates to those predicted in . What distinguishes our work from previous work is that the count of dichotomous edges in the multicolored MST is not Euclidean. Therefore, the results in [28, 17],, and  are not directly applicable.
Using the isoperimetric approach, Talagrand  showed that when the Euclidean functional is based on the MST or TSP, then the functional for derived random vertices uniformly distributed in a hypercube is concentrated around its mean. Namely, with high probability the functional and its mean do not differ by more than . In this paper, we establish concentration bounds for the FR statistic: with high probability the FR statistic differs from its mean by not more than , where is a function of and .
This paper is organized as follows. In Section II, we first introduce the HP-divergence and the FR multivariate test statistic. We then present the bias and variance rates of the FR-based estimator of HP-divergence followed by the concentration bounds and the minimax MSE convergence rate. Section III provides simulations that validate the theory. All proofs and relevant lemmas are given in the Appendices and Supplementary Materials.
Throughout the paper, we denote expectation by and variance by abbreviation . Bold face type indicates random variables.
Ii The Henze-Penrose divergence measure
Consider parameters and . We focus on estimating the HP-divergence measure between distributions and with domain defined by
It can be verified that this measure is bounded between 0 and 1 and if , then . In contrast with some other divergences such as the Kullback-Liebler  and Rényi divergences , the HP-divergence is symmetrical, i.e., . By invoking (3) in , one can rewrite in the alternative form:
Throughout the paper, we refer to as the HP-integral. The HP-divergence measure belongs to the class of -divergences . For the special case , the divergence (2) becomes the symmetric -divergence and is similar to the Rukhin -divergence. See , .
Ii-a The Multivariate Runs Test Statistic
The MST is a graph of minimum weight among all graphs that span vertices. The MST has many applications including pattern recognition , clustering , nonparametric regression , and testing of randomness . In this section we focus on the FR multivariate two sample test statistic constructed from the MST.
Assume that sample realizations from and , denoted by and , respectively, are available. Construct an MST spanning the samples from both and and color the edges in the MST that connect dichotomous samples green and color the remaining edges black. The FR test statistic is the number of green edges in the MST. Note that the test assumes a unique MST, therefore all inter point distances between data points must be distinct. We recall the following theorem from  and :
As and such that and ,
In the next section we obtain bounds on the MSE convergence rates of the FR approximation for HP-divergence between densities that belong to , the class of strong Hölder continuous functions with Lipschitz constant and smoothness parameter , :
(Strong Hölder class) Let be a compact space. The strong Hölder class , with -Hölder parameter, of functions with the -norm, consists of the functions that satisfy
where is the Taylor polynomial (multinomial) of of order expanded about the point and is defined as the greatest integer strictly less than . Note that for the standard Hölder class the term in the RHS of (4) is omitted.
In what follows, we will use both notations and for the FR statistic over the combined samples.
Ii-B Convergence Rates
In this subsection we obtain the mean convergence rate bounds for general non-uniform Lebesgue densities and belonging to the strong Hölder class . Since the expectation of can be closely approximated by the sum of the expectation of the FR statistic constructed on a dense partition of , then is a quasi-additive functional in mean. The family of bounds (30) in Appendix B enables us to achieve the minimax convergence rate for the mean under the strong Hölder class assumption with smoothness parameter , :
(Convergence Rate of the Mean) Let , and be the FR statistic for samples drawn from strong Hölder continuous and bounded density functions and in . Then for ,
This bound holds over the class of Lebesgue densities , . Note that this assumption can be relax to and that is Lebesgue densities and belong to the Strong Hölder class with the same Hölder parameter and different constants and respectively.
The following variance bound uses the Efron-Stein inequality . Note that in Theorem 3 we do not impose any strict assumptions. we only assume that the density functions are absolutely continuous and bounded with support on the unit cube . Appendix C contains the proof.
The variance of the HP-integral estimator based on the FR statistic, is bounded by
where the constant depends only on .
Ii-C Proof Sketch of Theorem 2
In this subsection, we first establish subadditivity and superadditivity properties of the FR statistic which will be employed to derive the MSE convergence rate bound. This will establish that the mean of the FR test statistic is a quasi-additive functional:
Let be the number of edges that link nodes from differently labeled samples and in . Partition into equal volume subcubes such that and are the number of samples from and , respectively, that fall into the partition . Then there exists a constant such that
Here is the number of dichotomous edges in partition . Conversely, for the same conditions as above on partitions , there exists a constant such that
where indicates the number of all edges of the MST which intersect two different partitions.
Furthermore, we adapt the theory developed in [17, 31] to derive the MSE convergence rate of the FR statistic-based estimator by defining a dual MST and dual FR statistic, denoted by and respectively (see Fig. 2):
(Dual MST, and dual FR statistic ) Let be the set of corner points of the subsection for . Then we define as the boundary MST graph of partition , which contains and points falling inside the section and those corner points in which minimize total MST length. Notice it is allowed to connect the MSTs in and through points strictly contained in and and corner points are taking into account under condition of minimizing total MST length. Another word, the dual MST can connect the points in by direct edges to pair to another point in or the corner the corner points (we assume that all corner points are connected) in order to minimize the total length. To clarify this, assume that there are two points in , then the dual MST consists of the two edges connecting these points to the corner if they are closed to a corner point otherwise dual MST consists of an edge connecting one to another. Further, we define as the number of edges in graph connecting nodes from different samples and number of edges connecting to the corner points. Note that the edges connected to the corner nodes (regardless of the type of points) are always counted in dual FR test statistic .
In Appendix B, we show that the dual FR test statistic is a quasi-additive functional in mean and . This property holds true since and graphs can only be different in the edges connected to the corner nodes, and in we take all of the edges between these nodes and corner nodes into account.
To prove Theorem 2, we partition into subcubes. Then by applying Theorem 4 and the dual MST we derive the bias rate in terms of partition parameter (see (30) in Theorem 8). See Appendix B and Supplementary Materials for details. According to (30), for , and , the slowest rates as a function of are and . Therefore we obtain an -independent bound by letting be a function of that minimizes the maximum of these rates i.e.
The full proof of the bound in (2) is given in Appendix B.
Ii-D Concentration Bounds
Another main contribution of our work in this part is to provide an exponential inequality convergence bound derived for the FR estimator of the HP-divergence. The error of this estimator can be decomposed into a bias term and a variance-like term via the triangle inequality:
The bias bound was given in Theorem 2. Therefore we focus on an exponential concentration bound for the variance-like term. One application of concentration bounds is to employ these bounds to compare confidence intervals on the HP-divergence measure in terms of the FR estimator. In  and  the authors provided an exponential inequality convergence bound for an estimator of Rény divergence for a smooth Hölder class of densities on the -dimensional unite cube . We show that if and are the set of and points drawn from any two distributions and respectively, the FR criteria is tightly concentrated. Namely, we establish that with high probability, is within
of its expected value, where is the solution of the following convex optimization problem:
See Appendix D for more detail. Indeed, we first show the concentration around the median. A median is by definition any real number that satisfies the inequalities and . To derive the concentration results, the properties of growth bounds and smoothness for , given in Appendix D, are exploited.
(Concentration around the median) Let be a median of which implies that . Recall from (9) then we have
(Concentration of around the mean) Let be the FR statistic. Then
Here and the explicit form for is given by (10) when .
See Appendix D for full proofs of Theorems 5 and 6. Here we sketch the proofs. The proof of the concentration inequality for , Theorem 6, requires involving the median , where , inside the probability term by using
To prove the expressions for the concentration around the median, Theorem 5, we first consider the uniform partitions of , with edges parallel to the coordinate axes having edge lengths and volumes . Then by applying the Markov inequality we show that with at least probability , where , the FR statistic is subadditive with threshold. Afterward, owing to the induction method , the growth bound can be derived with at least probability . The growth bound explains that with high probability there exists a constant depending on and , , such that . Applying the law of total probability and semi-isoperimetric inequality (123) in Lemma 11 gives us (49). By considering the solution to convex optimization problem (9), i.e. and optimal the claimed results (11) and (12) are derived. The only constraint here is that is lower bounded by a function of .
Next, we provide a bound for the variance-like term with high probability at least . According to the previous results we expect that this bound depends on , , and . The proof is short and is given in Appendix D.
(Variance-like bound for ) Let be the FR statistic. With at least probability we have
where depends on , and is given in (10) when .
Iii Numerical Experiments
Iii-a Simulation Study
In this section, we apply the FR statistic estimate of the HP-divergence to both simulated and real data sets. We present results of a simulation study that evaluates the proposed bound on the MSE. We numerically validate the theory stated in Subsection II-B and II-D using multiple simulations. In the first set of simulations, We consider two multivariate Normal random vectors , and perform three experiments , to analyze the FR test statistic-based estimator performance as the sample sizes , increase. For the three dimensions we generate samples from two normal distributions with identity covariance and shifted means: , and , and , when , and respectively. For all of the following experiments the sample sizes for each class are equal ().
We vary up to . From Fig. 3 we deduce that when the sample size increases the MSE decreases such that for higher dimensions the rate is slower. Furthermore we compare the experiments with the theory in Fig. 3. Our theory generally matches the experimental results. However, the MSE for the experiments tends to decrease to zero faster than the theoretical bound. Since the Gaussian distribution has a smooth density, this suggests that a tighter bound on the MSE may be possible by imposing stricter assumptions on the density smoothness as in .
In our next simulation we compare three bivariate cases: First, we generate samples from a standard Normal distribution. Second, we consider a distinct smooth class of distributions i.e. binomial Gamma density with standard parameters and dependency coefficient . Third, we generate samples from Standard t-student distributions. Our goal in this experiment is to compare the MSE of the HP-divergence estimator between two identical distributions, , when is one of the Gamma, Normal, and t-student density function. In Fig. 4, we observe that the MSE decreases as increases for all three distributions.
Iii-B Real Datasets
We now show the results of applying the FR test statistic to estimate the HP-divergence using three different real datasets, :
Human Activity Recognition (HAR), Wearable Computing, Classification of Body Postures and Movements (PUC-Rio): This dataset contains 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects.
Skin Segmentation dataset (SKIN): The skin dataset is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from the FERET and PAL databases .
Sensorless Drive Diagnosis (ENGIN) dataset: In this dataset features are extracted from electric current drive signals. The drive has intact and defective components. The dataset contains 11 different classes with different conditions. Each condition has been measured several times under 12 different operating conditions, e.g. different speeds, load moments and load forces.
We focus on two classes from each of the HAR, SKIN, and ENGIN datasets.
In the first experiment, we computed the HP-divergence and the MSE for the FR test statistic estimator as the sample size increases. We observe in Fig. 5 that the estimated HP-divergence ranges in , which is one of the HP-divergence properties, . Interestingly, when increases the HP-divergence tends to 1 for all HAR, SKIN, and ENGIN datasets. Note that in this set of experiments we have repeated the experiments on independent parts of the datasets to obtain the error bars. Fig. 6 shows that the MSE expectedly decreases as the sample size grows for all three datasets. Here we have used KDE plug-in estimator , implemented on the all available samples, to determine the true HP-divergence. Furthermore, according to Fig. 6 the FR test statistic-based estimator suggests that the Bayes error rate is larger for the SKIN dataset compared to the HAR and ENGIN datasets.
In our next experiment, we add the first 6 features (dimensions) in order to our datasets and evaluate the FR test statistic’s performance as the HP-divergence estimator. Surprisingly, the estimated HP-divergence doesn’t change for the HAR sample, however big changes are observed for the SKIN and ENGIN samples, (see Fig. 7).
Finally, we apply the concentration bounds on the FR test statistic (i.e. Theorems 6 and 7) and compute theoretical implicit variance-like bound for the FR criteria with error for the real datasets ENGIN, HAR, and SKIN. Since datasets ENGIN, HAR, and SKIN have the equal total sample size and different dimensions , respectively, here we first intend to compare the concentration bound (13) on the FR statistic in terms of dimension when . For real datasets ENGIN, HAR, and SKIN we obtain
where , respectively and is a constant not dependent on . One observes that as the dimension decreases the interval becomes significantly tighter. However, this could not be generally correct and computing bound (13) precisely requires the knowledge of distributions and unknown constants. In Table 1 we compute the standard variance-like bound by applying the percentiles technique and observe that the bound threshold is not monotonic in terms of dimension . Table 1 shows the FR test statistic, HP-divergence estimate (denoted by , , respectively), and standard variance-like interval for the FR statistic using the three real datasets HAR, SKIN, and ENGIN.
|FR test statistic|
We derived a bound on the MSE convergence rate for the Friedman-Rafsky estimator of the Henze-Penrose divergence assuming the densities are sufficiently smooth. We employed a partitioning strategy to derive the bias rate which depends on the number of partitions, the sample size , the Hölder smoothness parameter , and the dimension . However by using the optimal partition number, we derived the MSE convergence rate only in terms of , , and . We validated our proposed MSE convergence rate using simulations and illustrated the approach for the meta-learning problem of estimating the HP-divergence for three real-world data sets. We also provided concentration bounds around the median and mean of the estimator. These bounds explicitly provide the rate that the FR statistic approaches its median/mean with high probability, not only as a function of the number of samples, , , but also in terms of the dimension of the space . By using these results we explored the asymptotic behavior of a variance-like rate in terms of , , and .
Appendix A Proof of Theorem 4
In this section, we prove the subadditivity and superadditivity for the mean of FR test statistic. For this, first we need to illustrate the following lemma.
Let be a uniform partition of into subcubes with edges parallel to the coordinate axes having edge lengths and volumes . Let be the set of edges of MST graph between and with cardinality , then for defined as the sum of for all , , we have , or more explicitly
where is the Hölder smoothness parameter and
Here and in what follows, denote the length of the shortest spanning tree on , namely
where the minimum is over all spanning trees of the vertex set . Using the subadditivity relation for in , with the uniform partition of into subcubes with edges parallel to the coordinate axes having edge lengths and volumes , we have
where is constant. Denote the set of all edges of which intersect two different subcubes and with cardinality . Let be the length of -th edge in set . We can write
also we know that
Note that using the result from (, Proposition 3), for some constants and , we have
Now let and , hence we can bound the expectation (17) as
where is defined as in Lemma 1. Let and be the number of sample and respectively falling into the partition , such that and . Introduce sets and as
Since set has fewer edges than set , thus (19) implies that the difference set of and contains at most edges, where is the number of edges in . On the other word
The number of edge linked nodes from different samples in set is bounded by the number of edge linked nodes from different samples in set plus :
Here stands with the number edge linked nodes from different samples in partition , . Next, we address the reader to Lemma 1, where it has been shown that there is a constant such that . This concludes the claimed assertion (7). Now to accomplish the proof, the lower bound term in (8) is obtained with similar methodology and the set inclusion:
This completes the proof.
Appendix B Proof of Theorem 2
As many of continuous subadditive functionals on , in the case of FR statistic there exist a dual superadditive functional based on dual MST, , proposed in Definition 2. Note that in MST* graph, the degrees of the corner points are bounded by where only depends on dimension , and is the bound for degree of every node in MST graph. The following properties hold true for dual FR test statistic, :
Given samples and , the following inequalities hold true:
For constant which depends on :
(Subadditivity on and Superadditivity) Partition into subcubes such that , be the number of sample and respectively falling into the partition with dual . Then we have
where is a constant.
(i) Consider the nodes connected to the corner points. Since and can only be different in the edges connected to these nodes, and in we take all of the edges between these nodes and corner nodes into account, so we obviously have the second relation in (22). Also for the first inequality in (22) it is enough to say that the total number of edges connected to the corner nodes is upper bounded by .
(ii) Let be the set of edges of the graph which intersect two different partitions. Since MST and are only different in edges of points connected to the corners and edges crossing different partitions. Therefore . By eliminating one edge in set in worse scenario we would face with two possibilities: either the corresponding node is connected to the corner which is counted anyways or any other point in MST graph which wouldn’t change the FR test statistic. This implies the following subadditivity relation:
Further from Lemma 1, we know that there is a constant such that . Hence the first inequality in (23) is obtained. Next consider which represents the total number of edges from both samples only connected to the all corners points in graph. Therefore one can easily claim:
Also we know that where stands with the largest possible degree of any vertex. One can write
Let be a density function with support and belong to the strong Hölder class , , stated in Definition 1. Also, assume that is a -Hölder smooth function, such that its absolute value is bounded from above by a constant. Define the quantized density function with parameter and constants as
Let and . Then
Denote the degree of vertex in the over set with the number of vertices. For given function , one obtains
where for constant ,
Assume that for given , is a bounded function belong to . Let be a symmetric, smooth, jointly measurable function, such that, given , for almost every , is measurable with a Lebesgue point of the function . Assume that the first derivative is bounded. For each , let