Kernel Two-Sample Hypothesis Testing Using Kernel Set Classification
The two-sample hypothesis testing problem is studied for the challenging scenario of high dimensional data sets with small sample sizes. We show that the two-sample hypothesis testing problem can be posed as a one-class set classification problem. In the set classification problem the goal is to classify a set of data points that are assumed to have a common class. We prove that the average probability of error given a set is less than or equal to the Bayes error and decreases as a power of number of sample data points in the set. We use the positive definite Set Kernel for directly mapping sets of data to an associated Reproducing Kernel Hilbert Space, without the need to learn a probability distribution. We specifically solve the two-sample hypothesis testing problem using a one-class SVM in conjunction with the proposed Set Kernel. We compare the proposed method with the Maximum Mean Discrepancy, F-Test and T-Test methods on a number of challenging simulated high dimensional and small sample size data. We also perform two-sample hypothesis testing experiments on six cancer gene expression data sets and achieve zero type-I and type-II error results on all data sets.
Set Classification, Positive Definite Kernel, Two-Sample Hypothesis Testing, One-Class Classification, Maximum Mean Discrepancy, Gene Expression Data
Many problems are naturally in the form of a set classification problem which is defined as classifying a set of data points given that all the data points in the set belong to the same unknown class [KernelForSets, SetImageNonParamModel, SetClass]. In other words, in a set classification problem we classify a set of data vectors rather than a single vector. For example the pixels of an image can be thought of as a set and classifying the image can be thought of as classifying the set of pixels in the image [KernelForSets]. As another example, face recognition from multiple images of the same person can be posed as a set classification problem where the set of multiple images must be assigned to a certain individual [FaceSetLongTerm, FaceSet, FaceRecogSetsBased]. Many other problems such as gene expression or chemical classification, document classification, ontology alignment, scene classification, video classification and multiple pose object recognition can be naturally posed as set classification problems [SetVideo, ChemSet2, ChemSet, SetClass, SparseImageSet, book:Ontology, ImageSet].
There are generally two different approaches to the set classification problem. The first approach basically uses a standard classifier on the individual elements of the set and then applies a variety of voting schemes to reach a consensus on the entire set [SetClass]. In this paper we formally prove in Section V that this approach is suboptimal. Intuitively, these type of methods do not make full use of all the information available and ignore the inter-dependencies between the elements of the set. They classify each element independently as apposed to using all of the data points concurrently to learn a class for the entire set of samples.
The second approach is to somehow summarize the set of data points into a single entity and then make a classification decision based on this single entity. For example, a simple approach could be to summarize a set of vectors into an average vector and classify the average vector. A more advanced approach could be to summarize the set of vectors into a specialized probability distribution and then make a decision on the probability distribution [KernelForSets, SetParamModel1, SetParamModel3, SetParamModel4, SetFaceParamModel] or to use kernel or nonparametric methods that directly measure distances between sets [SetDistances, SetImageNonParamModel, SetImageNonParamModel2, FaceRecogSetsBased, SparseImageSet]. These methods can suffer from a few deficiencies. Namely, estimating a probability distribution is generally problematic in high dimensional spaces and many of the proposed kernels are not positive definite.
We take the second approach while avoiding its pitfalls. Specifically, we use a method where each set of vectors of any size is mapped directly to a vector in a Reproducing Kernel Hilbert Space (RKHS), without the need to model the distribution. Notably, the Set Kernel associated with this RKHS is proven to be a positive definite inner product kernel and the norm associated with this inner product kernel is the empirical Maximum Mean Discrepancy (MMD) [TwoMMD, MMD]. The theoretical properties of the MMD have been extensively studied in [MMD] and the empirical MMD has been justified using performance guarantees. Rather than looking at the empirical MMD as an approximation to the MMD, it can be independently justified as a norm in a certain RKHS of sets.
The MMD has found many recent applications in diverse fields ranging from image analysis [MMDImage] and class ratio estimation [MMDClassRatio] to nonparametric scoring rules [MMDScoreRule]. Nevertheless, it was initially introduced as a nonparametric two-sample hypothesis test [MMD]. Traditional parametric hypothesis tests such as the T-Test and F-Test [book:TTEST] are not suited for high dimensional data because of poor estimation in high dimensional spaces. The nonparametric MMD method on the other hand, was proven to have the injective property [MMD] and shown to have a significant performance advantage in high dimensional data problems. For example it was shown to significantly outperform traditional hypothesis testing methods on gene expression data which are high dimensional in nature [MMD].
In light of the alternative way of looking at the empirical MMD as a norm or distance measure between sets in an RKHS, the hypothesis testing method based on the empirical MMD can be thought of as a primitive one-class classification problem. Specifically, this hypothesis test is based on learning a threshold for the distances between different sets in the training set. It then tests a set by comparing the distance between it and a training set. The test set is rejected if it is above the threshold.
Here we suggest that the above primitive method can be improved using more advanced one-class classifiers [OneClassPhd] given a well defined RKHS for sets. Rather than learn a single threshold on the distances between sets, we learn a one-class SVM [OneClassSVM] on the RKHS of sets using the associated Set Kernel. The training data sets are used to learn the one-class SVM boundary and a test set is rejected if it falls outside this boundary in the RKHS.
In the experimental section we show that this novel approach to the two-sample hypothesis problem, using the one-class SVM with Set Kernels, leads to state of the art results on challenging simulated and real world data sets. We consider multivariate Gaussian classes with equal means and different variances in various low and high dimensional spaces which are challenging for both the F-Test and MMD methods. We also consider various gene expression data sets and show that the one-class SVM with Set Kernels method has perfect performance on these high dimensional data sets.
The paper is organized as follows. In Sections II and III we review some background material on Reproducing Kernel Hilbert Spaces and kernel two-sample hypothesis testing using the MMD. Next, in Section IV, we establish the RKHS for sets by considering the Set Kernel and establishing that it is positive definite. In Section V we motivate the use of a set classifier by proving that the average probability of error decreases as we use a larger set of sample data points for classification. The two-sample hypothesis test method using the one-class SVM with Set Kernels in explained in Section VI. Finally, the two-sample hypothesis test experimental results are presented in Section VII using both simulated data and real world gene expression data.
Ii Background on Reproducing Kernel Hilbert Spaces
Data samples in the input space, , are typically mapped to a higher dimensional feature space for improved separation between the classes. The kernel function can be viewed as an efficient way of computing inner products in this high dimensional feature space [book:RHKS, Sriperumbudur2010]. For a given mapping , the kernel function allows us to compute the inner product between two vectors in the feature space without having to explicitly know the mapping , in the form of
If a function happens to be positive definite, meaning that
for any , any choice of and any coefficients , then there exists a Hilbert space and a mapping such that computes the inner product in that space. In other words, we can write a positive definite function in the form of an inner product and conversely if a function can be written as an inner product it is positive definite.
Furthermore, there is a Reproducing Kernel Hilbert Space (RKHS) associated with every positive definite kernel such that
where the mapping can be written as and is the positive definite kernel function parametrized by .
Iii Background on Kernel Two Sample Hypothesis Testing Using the MMD
The two sample hypothesis test consists of answering the question of whether we can distinguish between two probability distributions and given only two sets of independent and identically distributed sets and sampled from and respectively. The first approach that comes to mind in solving this problem might be to choose a model and estimate the parameters of the model using the given data sets. Such traditional methods generally don’t work well in high dimensional data spaces with limited data because of poor estimation properties in high dimensions.
The Maximum Mean Discrepancy (MMD) [MMD] is a nonparametric method for dealing with this problem in a Reproducing Kernel Hilbert Space and can be explained as follows. First we define the mean embedding of the distribution to be the expectation under of the mapping which can also be written as
The maximum mean discrepancy (MMD) of the two embedded distributions and is now expressed as the squared difference between their respective embedded means and as
It can be shown that under certain non-restricting conditions on the RKHS, if and only if . In other words the MMD is injective [MMD].
Given two independent and identically distributed sets and sampled from and the empirical MMD can be readily derived using the empirical mean embeddings as
and shown to have favorable performance guarantees. Along with the injective property, the empirical MMD can be used to deal with the two sample hypothesis test problem by checking to see if the empirical MMD is less than a learned threshold. If the MMD of the two sets and is below the threshold and thus sufficiently close to zero then and are likely to have been sampled from the same distribution and we conclude that . This approach makes no model assumptions and is nonparametric and has been shown to have state of the art performance on high dimensional data sets with limited data [TwoMMD, MMD].
Iv An RKHS for Sets
In this section we construct an RKHS for sets of vectors. In later sections we will use this RHKS to deal with the two sample hypothesis test problem in a more effective manner.
We define the Set-Kernel on two sets of vectors and where as
where is a positive definite kernel with associated mapping . Also, the associated Set-Mapping is defined as
The Set-Kernel is a valid inner product kernel and thus positive definite. To show this we first write the Set-Kernel in the form of an inner product as
The positive definiteness of the Set-Kernel is now established by noting that it is the sum of positive definite kernels and so we write
The Set-Kernel is similar to the Derived Subset Kernel defined in [book:KernelMethodsforPatternAnalysis] and exactly the same as those proposed in [SetKernelOrigin, SetKernelOrigin2]. This specific definition allows for a direct connection to the empirical MMD as follows. First, we note that the induced norm in this RKHS for sets can be written as
The empirical MMD is now equal to the distance between two vectors in the RKHS for sets squared. Formally, Let and , where , be two independent and identically distributed sets sampled from the two respective distributions of and . Also, let be a positive definite kernel with associated mapping and be the Set-Kernel of (9) with associated Set-Mapping , then
We can prove the above equality by writing
Figure 1 visualizes the connections between the empirical MMD and the Set-Kernel mapping.
In summary, we now have an RKHS where the inner product between sets is defined by (9), the norm of a set is defined by (16) and the distance between sets is defined by (18). Any kernalized classification algorithm based on computing inner products, norms or distances can now be applied to sets of vectors.
In the next section we first prove the interesting result that the average probability of error for classifying a set is less than the Bayes error associated with the problem. We then specifically solve the two sample hypothesis testing problem by treating it as a one-class classification problem in the proposed RKHS for sets.
V The Average Probability of Error for Classifying a Set
A classifier is a mapping from a feature vector to a class label . Class labels and feature vectors are sampled from the probability distributions and respectively. If we only have a single data point , the probability of error given is
The average probability of error for a single point is known as the Bayes Error or Bayes Risk and can be derived as
where we have assumed equal priors. The first term is known as the miss rate (type-I error) and the second term is known as the false positive rate (type-II error). It is well known that for fixed data distributions and , an average probability of error that is less than the Bayes Error is not possible. Note that this is all conditioned on the assumption that we are basing our decisions on a single data point .
Interestingly, we show that the Bayes Error can be beaten if we base our decisions on a set of data points. Specifically, if we have a set of data points that are all identically and independently sampled from the same class, then the probability of error (miss-classifying all points) given is
where is the probability that all points were sampled from class .
Using the chain rule of probability and the fact that are all sampled independently we can write
We can now derive the average probability of error for a set which we denote as the Set Bayes Error
Since and are less than or equal to one then
The above result is intuitive since it states that the average probability of error is less if we base our decision on a set of data samples rather than a single sample data point. It also states that the average probability of error decrease as a power of number of sample data points. This important result serves as a motivation for the next section in which we define the hypothesis testing problem as a set classification problem.
Vi The Two Sample Hypothesis Test as a One-Class Set Classification Problem
In a two sample hypothesis test problem we are only provided with a set of samples from a distribution . This is the only training data we have and we typically do not get to see any samples from any other alternative distribution. During testing we are presented with a set and we must decide if consists of samples from the distribution or not.
In view of the RKHS for sets, the two sample hypothesis test problem can now be treated as a one-class classification problem. Specifically, we need to classify and decide if it belongs to class or not. The training data consists of non empty subsets of , such as , , …, which are each considered a vector in the RKHS of sets. Any kernalized one-class classification algorithm can learn the reject region in the RKHS of sets and decide if a test vector is in the reject region or not. In this paper we specifically use the effective one-class SVM described in [OneClassSVM] which solves the following optimization problem
The associated dual problem is
The final decision function applied to a test set is derived from the dual problem and can be written as
where are the dual parameters and is the Set-Kernel with associated Set-Mapping . Two sample hypothesis testing using one-class SVM and Set-Kernels is summarized in Algorithm-1.
While the MMD is a state of the art method for hypothesis testing, its shortcoming is evident when viewed from the standpoint of Set-Kernels. It trains by measuring the distances between different data points in the RKHS of sets and learns a single threshold on these measured distances. If a test set has a distance from the training data set that is above the learned threshold then it is rejected. This is a primitive classification method when compared to the one-class SVM with Set-Kernels. In the next section we compare the MMD and proposed method on different simulated and real world data sets.
Vii Experimental Results
In this section we perform two sample hypothesis tests on both simulated Gaussian data, and benchmark cancer gene expression data sets and compare the performance of the MMD, F-Test and T-Test methods against the proposed one-class SVM with Set-Kernels method.
Vii-a Simulated Gaussian Data Sets
Two sets of experiments are reported on simulated Gaussian data. In the first set of experiments we simulated data from two Gaussians of equal means and different variances on a range of different dimensions. We used two Gaussians, and , where we fixed and and varied the dimensions of the two Gaussians over Dim. We sampled points from the training distribution of which were used for training and the other were used for testing the type-I error. Also, another points were sampled from and used to test the type-II error. Distinguishing between the two distributions of and , that only differ slightly in variance, is generally considered a challenging problem especially with such limited training samples.
The MMD reject threshold was found using the standard procedure from bootstrap iterations for a fixed type-I error of . We used the Gaussian embedding as the base kernel in all our experiments and the Gaussian kernel parameter was found using the median heuristic of [MMD]. The standard two sample F-test for equal variances was also performed for a fixed type-I error of . Finally, the one-class SVM was trained using the LibSVM toolbox [LIBSVM] with and precomputed Set-Kernels with random subsets of fixed set size. The base kernel was a Gaussian with fixed kernel parameter equal to and the SVM threshold parameter was found from cross validation. Finally, a fixed training and testing set size of elements each was used for all methods.
The type-I and type-II error test results averaged over repetitions are reported in Table I for the MMD, F-Test and one-class SVM with Set-Kernels (SVM+SetKernel) methods. To better visualize the contrast in performance between the three methods we have also plotted the sum of both type-I and type-II errors as Total Error over different dimensions in Figure 2.
The one-class SVM with Set Kernels has markedly smaller type-II error at about the same type-I error for all dimensions which results in significant lower Total Error as see in Figure 2. While the risk associated with the problem decreases at higher dimensions, the F-Test does worse at dimensions higher than which emphasizes its inability to deal with high dimensional data. The MMD has the opposite problem of doing poorly on low dimensional data sets [MMDBio], yet still has markedly higher error than the one-class SVM with Set Kernels even in high dimensions.
In the second set of experiments we repeated the above under the same conditions but considered a much more difficult problem where we fixed and . The type-I and Type-II error test results for repetitions are reported in Table II for the MMD, F-Test and one-class SVM with Set-Kernels (SVM+SetKernel) methods. We have also plotted the sum of both type-I and type-II errors as Total Error over different dimensions in Figure 3.
The one-class SVM with Set Kernels has significatly smaller total error at all dimensions. The F-Test, which is based on estimation techniques, completely breaks down in the high dimensions while the MMD method also performs poorly at all dimensions because it generally has problems in dealing with data sets of equal means and different variances.
Vii-B Benchmark Cancer Gene Data Sets
Next we performed two sample hypothesis tests on six high dimensional benchmark gene expression data sets downloaded from [GeneWebsite]. The data sets are challenging because of their small sample size and high dimensions, the details of which are provided in Table III. The experiments involved splitting the positive samples into a train set and a leave out set for testing the type-I error. The number of positive train samples, positive leave out samples, test negative samples and fixed set size used in each experiment are also reported in Table III. The train set was used to learn the MMD reject thresholds from bootstrap iterations for a fixed type-I error of . We again used the Gaussian kernel and the Gaussian kernel parameter was found using the median heuristic of [MMD]. The standard two sample T-Test was also performed for a fixed type-I error of . The one-class SVM was trained using the LibSVM toolbox [LIBSVM] with and precomputed Set-Kernels with random subsets of fixed set size detailed in Table III for each data set. The base kernel was a Gaussian with fixed kernel parameter equal to .
The type-I and type-II error test results averaged over repetitions are reported in Table IV for the MMD, T-Test and one-class SVM with Set-Kernels (SVM+SetKernel) methods. The one-class SVM with Set-Kernels method has type-I and type-II error on all data sets while the MMD method has considerably higher error rates on all data sets. The T-Test method completely fails on these data sets. This is not surprising when considering the fact that these are generally very high dimensional data sets in the range of , and dimensions as seen from Table III. Any method based on estimation techniques, such as the T-Test, will fail in such high dimensions while these type of data sets are easily separable in such high dimensions for a one-class SVM with appropriately chosen Set-Kernels.
|Number||Data Set||# Train Pos.||#Leave Out Pos.||# Test Neg.||# Set Size||#Dimensions|
|#1||Lung Cancer Womenâs Hospital||21||10||150||7||12533|
|#3||Lymphoma Harvard Outcome||17||9||32||6||7129|
|#5||Central Nervous System Tumor||14||7||39||4||7129|
|Lung Cancer Women’s Hospital|
|Lymphoma Harvard Outcome|
|Central Nervous System Tumor|
In this work we framed the two sample hypothesis test problem as a one-class classification problem in an appropriate RKHS on sets. We showed how to map a set into this RKHS using the provably positive definite Set Kernel. Interestingly, the empirical MMD is the induced norm in this RKHS. Under this view, the MMD method for hypothesis testing can be seen as placing a simple threshold on the distances between training sets. We proved that the average probability of error for classifying a set of data samples decreases as a power of the number of samples and propose to use the effective one-class SVM classifier to perform the hypothesis test. This is made possible by the appropriately defined positive definite Set-Kernel. Unlike most traditional hypothesis testing methods such as the F-Test and T-Test, the proposed method is nonparametric meaning that it does not attempt to estimate the parameters of a probability distribution. This makes the proposed method suitable for applications with limited high dimensional data. Also, unlike the MMD based method, the proposed method uses the one-class SVM classifier and learns a nonlinear decision surface on the data rather than a single threshold. This gives the proposed method much higher discriminating ability and classification accuracy resulting in lower type-I and type-II errors. We tested the proposed method on a number of data sets that were designed to evaluate different challenging scenarios. We first considered Gaussian data sets of equal mean and different variance with a small number of training samples. This is challenging for the F-Test because of the small size of the training set and it is challenging for the MMD method because the data has equal mean and different variance. The proposed method significantly outperformed both the F-Test and MMD given that the one-class SVM with the Set Kernel can learn complicated decision surfaces with limited data in both low and high dimensions. Finally, we considered six real world high dimensional gene expression data sets with small sample sizes. The T-Test completely failed on these high dimensional data sets and the MMD had suboptimal performance. On the other hand, The one class SVM with Set Kernels had ideal performance with zero type-I and and zero type-II error on all data sets.