Many contemporary statistical learning methods assume a Euclidean feature space. This paper presents a method for defining similarity based on hyperspherical geometry and shows that it often improves the performance of support vector machine compared to other competing similarity measures. Specifically, the idea of using heat diffusion on a hypersphere to measure similarity has been previously proposed and tested by , demonstrating promising results based on a heuristic heat kernel obtained from the zeroth order parametrix expansion; however, how well this heuristic kernel agrees with the exact hyperspherical heat kernel remains unknown. This paper presents a higher order parametrix expansion of the heat kernel on a unit hypersphere and discusses several problems associated with this expansion method. We then compare the heuristic kernel with an exact form of the heat kernel expressed in terms of a uniformly and absolutely convergent series in high-dimensional angular momentum eigenmodes. Being a natural measure of similarity between sample points dwelling on a hypersphere, the exact kernel often shows superior performance in kernel SVM classifications applied to text mining, tumor somatic mutation imputation, and stock market analysis.
Exact heat kernel on a hypersphere and
its applications in kernel SVM
Chenchao Zhao , and Jun S. Song
Department of Physics,
University of Illinois at Urbana-Champaign,
Urbana, IL 61801, USA
Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
As the techniques for analyzing large data sets continue to grow, diverse quantitative sciences – including computational biology, observation astronomy, and high energy physics – are becoming increasingly data driven. Moreover, modern business decision making critically depends on quantitative analyses such as community detection and consumer behavior prediction. Consequently, statistical learning has become an indispensable tool for modern data analysis. Data acquired from various experiments are usually organized into an matrix, where the number of features typically far exceeds the number of samples. In this view, the samples, corresponding to the columns of the data matrix, are naturally interpreted as points in a high-dimensional feature space . Traditional statistical modeling approaches often lose their power when the feature dimension is high. To ameliorate this problem, Lafferty and Lebanon proposed a multinomial interpretation of non-negative feature vectors and an accompanying transformation of the multinomial simplex to a hypersphere, demonstrating that using the heat kernel on this hypersphere may improve the performance of kernel support vector machine (SVM) [1, 2, 3, 4, 5, 6, 7]. Despite the interest that this idea has attracted, only approximate heat kernel is known to date. We here present an exact form of the heat kernel on a hypersphere of arbitrary dimension and study its performance in kernel SVM classifications of text mining, genomic, and stock price data sets.
To date, sparse data clouds have been extensively analyzed in the flat Euclidean space endowed with the -norm using traditional statistical learning algorithms, including KMeans, hierarchical clustering, SVM, and neural network [2, 8, 3, 4, 5, 6, 7]; however, the flat geometry of the Euclidean space often poses severe challenges in clustering and classification problems when the data clouds take non-trivial geometric shapes or class labels are spatially mixed. Manifold learning and kernel-based embedding methods attempt to address these challenges by estimating the intrinsic geometry of a putative submanifold from which the data points were sampled and by embedding the data into an abstract Hilbert space using a nonlinear map implicitly induced by the chosen kernel, respectively [9, 10, 11]. The geometry of these curved spaces may then provide novel information about the structure and organization of original data points.
Heat equation on the data submanifold or transformed feature space offers an especially attractive idea of measuring similarity between data points by using the physical model of diffusion of relatedness (“heat”) on curved space, where the diffusion process is driven by the intrinsic geometry of the underlying space. Even though such diffusion process has been successfully approximated as a discrete-time, discrete-space random walk on complex networks, its continuous formulation is rarely analytically solvable and usually requires complicated asymptotic expansion techniques from differential geometry . An analytic solution, if available, would thus provide a valuable opportunity for comparing its performance with approximate asymptotic solutions and rigorously testing the power of heat diffusion for geometric data analysis.
Given that a Riemannian manifold of dimension is locally homeomorphic to , and that the heat kernel is a solution to the heat equation with a point source initial condition, one may assume in the short diffusion time limit () that most of the heat is localized within the vicinity of the initial point and that the heat kernel on a Riemannian manifold locally resembles the Euclidean heat kernel. This idea forms the motivation behind the parametrix expansion, where the heat kernel in curved space is approximated as a product of the Euclidean heat kernel in normal coordinates and an asymptotic series involving the diffusion time and normal coordinates. In particular, for a unit hypersphere, the parametrix expansion in the limit involves a modified Euclidean heat kernel with the Euclidean distance replaced by the geodesic arc length . Computing this parametrix expansion is, however, technically challenging; even when the computation is tractable, applying the approximation directly to high-dimensional clustering and classification problems may have limitations. For example, in order to be able to group samples robustly, one needs the diffusion time to be not too small; otherwise, the sample relatedness may be highly localized and decay too fast away from each sample. Moreover, the leading order term in the asymptotic series is an increasing function of and diverges as approaches , yielding an incorrect conclusion that two antipodal points are highly similar. For these reasons, the machine learning community has been using only the Euclidean diffusion term without the asymptotic series correction; how this resulting kernel, called the parametrix kernel , compares with the exact heat kernel on a hypersphere remains an outstanding question, which is addressed in this paper.
Analytically solving the diffusion equation on a Riemannian manifold is challenging [13, 14, 12]. Unlike the discrete analogues – such as spectral clustering  and diffusion map , where eigenvectors of a finite dimensional matrix can be easily obtained – the eigenfunctions of the Laplace operator on a Riemannian manifold are usually intractable. Fortunately, the high degree of symmetry of a hypersphere allows the explicit construction of eigenfunctions, called hyperspherical harmonics, via the projection of homogeneous polynomials [17, 18]. The exact heat kernel is then obtained as a convergent power series in these eigenfunctions. In this paper, we compare the analytic behavior of this exact heat kernel with that of the parametrix kernel and analyze their performance in classification.
The heat kernel is the fundamental solution to the heat equation with an initial point source , where is the Laplace operator; the amount of heat emanating from the source that has diffused to a neighborhood during time is used to measure the similarity between the source and proximal points. The heat conduction depends on the geometry of feature space, and the main idea behind the application of hyperspherical geometry to data analysis relies on the following map from a non-negative feature space to a unit hypersphere:
A hyperspherical map maps a vector , with and , to a unit vector where .
We will henceforth denote the image of a feature vector under the hyperspherical map as . The notion of neighborhood requires a well-defined measurement of distance on the hypersphere, which is naturally the great arc length – the geodesic on a hypersphere. Both parametrix approximation and exact solution employ the great arc length, which is related to the following definition of cosine similarity:
The generic cosine similarity between two feature vectors is
where is the Euclidean -norm, and is the great arc length on . For unit vectors and obtained from non-negative feature vectors via the hyperspherical map, the cosine similarity reduces to the dot product ; the non-negativity of and guarantees that in this case.
2.1 Parametrix expansion
The parametrix kernel previously used in the literature is just a Gaussian RBF function with as the radial distance :
The parametrix kernel is a non-negative function
defined for and attaining global maximum at .
Note that this kernel is assumed to be restricted to the positive orthant. The normalization factor is numerically unstable as and complicates hyperparameter tuning; as a global scaling factor of the kernel can be absorbed into the misclassification -parameter in SVM, this overall normalization term is ignored in this paper. Importantly, the parametrix kernel is merely the Gaussian multiplicative factor without any asymptotic expansion terms in the full parametrix expansion of the heat kernel on a hypersphere [1, 12], as described below.
The Laplace operator on manifold equiped with a Riemannian metric acts on a function that depends only on the geodesic distance from a fixed point as
where and denotes the radial derivative. Due to the nonvanishing metric derivative in Equation 1, the canonical diffusion function
does not satisfy the heat equation; that is, (Supplementary Material, Section S2). For sufficiently small time and geodesic distance , the parametrix expansion of the heat kernel on a full hypersphere proposes an approximate solution
where the functions should be found such that satisfies the heat equation to order , which is small for and ; more precisely, we seek such that
Taking the time derivative of yields
while the Laplacian of is
One can easily compute
The left-hand side of Equation 3 is thus equal to multiplied by
and we need to solve for such that all the coefficients of in this expression, for , vanish.
For , we need to solve
Integrating with respect to yields
where we implicitly take only the radial part of . Thus, we get
as the zeroth-order term in the parametrix expansion. Using this expression of , the remaining terms become
and we obtain the recursion relation
Algebraic manipulations show that
from which we get
Integrating this equation and rearranging terms, we finally get
Setting in this recursion equation, we find the second correction term to be
From our previously obtained solution for , we find
Substituting these expressions into the recursion relation for yields
For the hypersphere , where and , we have
Notice that when and when . For , is an increasing function in and diverges to at . By contrast, for , is a decreasing function in and diverges to at ; is relatively constant for and starts to decrease rapidly only near . Therefore, the first order correction is not able to remove the unphysical behavior near in high dimensions where, according to the first order parametrix kernel, the surrounding area is hotter than the heat source.
Next, we apply Equation 4 again to obtain as
After some cumbersome algebraic manipulations, we find
Again, and are special dimensions, where for , and for ; for other dimensions, is singular at both and . Note that on , the metric in geodesic polar coordinate is , so all parametrix expansion coefficients must vanish identically, as we have explicitly shown above.
Thus, the full defined on a hypersphere, where the geodesic distance is just the arc length , suffers from numerous problems. The zeroth order correction term diverges at ; this behavior is not a major problem if is restricted to the range . Moreover, is also unphysical as when ; this condition on dimension and time is obtained by expanding and , and noting that the leading order term in the product of the two factors is a non-decreasing function of distance when , corresponding to the unphysical situation of nearby points being hotter than the heat source itself. As the feature dimension is typically very large, the restriction implies that we need to take the diffusion time to be very small, thus making the similarity measure captured by decay too fast away from each data point for use in clustering applications. In this work, we further computed the first and second order correction terms, denoted and in Equation 5 and Equation 6, respectively.In high dimensions, the divergence of and at is not a major problem, as we expect the expansion to be valid only in the vicinity ; however, the divergence of at (to in high dimensions) is pathological, and thus, we truncate our approximation to . Since is not able to correct the unphysical behavior of the parametrix kernel near in high dimensions, we conclude that the parametrix approximation fails in high dimensions. Hence, the only remaining part of still applicable to SVM classification is the Gaussian factor, which is clearly not a heat kernel on the hypersphere. The failure of this perturbative expansion using the Euclidean heat kernel as a starting point suggests that diffusion in and are fundamentally different and that the exact hyperspherical heat kernel derived from a non-perturbative approach will likely yield better insights into the diffusion process.
2.2 Exact hyperspherical heat kernel
By definition, the exact heat kernel is the fundamental solution to heat equation where is the hyperspherical Laplacian [19, 20, 13, 14]. In the language of operator theory, is an integral kernel, or Green’s function, for the operator and has an associated eigenfunction expansion. Because and share the same eigenfunctions, obtaining the eigenfunction expansion of amounts to solving for the complete basis of eigenfunctions of . The spectral decomposition of the Laplacian is in turn facilitated by embedding in and utilizing the global rotational symmetry of in . The Euclidean space harmonic functions, which are the solutions to the Laplace equation in , can be projected to the unit hypersphere through the usual separation of radial and angular variables [17, 18]. In this formalism, the hyperspherical Laplacian on naturally arises as the angular part of the Euclidean Laplacian on , and can be interpreted as the squared angular momentum operator in .
The resulting eigenfunctions of are known as the hyperspherical harmonics and generalize the usual spherical harmonics in to higher dimensions. Each hyperspherical harmonic is equipped with a triplet of parameters or “quantum numbers” : the degree , magnetic quantum numbers and . In the eigenfunction expansion of , we use the addition theorem of hyperspherical harmonics to sum over the magnetic quantum number and obtain the following main result:
The exact hyperspherical heat kernel can be expanded as a uniformly and absolutely convergent power series
in the interval and for , where are the Gegenbauer polynomials and is the surface area of . Since the kernel depends on and only through , we will write .
Proof. We will obtain an eigenfunction expansion of the exact heat kernel by using the lemmas proved in Supplementary Material Section S2.5.3. The completeness of hyperspherical harmonics (Lemma 1) states that
Applying the addition theorem (Lemma 2) to Equation 7, we get
Next, we apply time evolution operator on this initial state to generate the heat kernel
To show that it is a uniformly and absolutely convergent series for , note that
The terms involving Gegenbauer polynomials can be bounded by using Lemma 3 as
We thus have
But, in the large limit, the asymptotic expansion
for any . The sequence is thus convergent, and hence, the Weiestrass M-test implies that the eigenfunction expansion of the heat kernel is uniformly and absolutely convergent in the indicated intervals. Q.E.D.
Note that the exact kernel is a Mercer kernel re-expressed by summing over the degenerate eigenstates indexed by . As before, we will rescale the kernel by self-similarity and define:
The exact kernel is the exact heat kernel normalized by self-similarity:
which is defined for , is non-negative, and attains global maximum at .
Note that unlike , explicitly depends on the feature dimension . In general, SVM kernel hyperparameter tuning can be computationally costly for a data set with both high feature dimension and large sample size. In particular, choosing an appropriate diffusion time scale is an important challenge. On the one hand, choosing a very large value of will make the series converge rapidly; but, then, all points will become uniformly similar, and the kernel will not be very useful. On the other hand, a too small value of will make most data pairs too dissimilar, again limiting the applicability of the kernel. In practice, we thus need a guideline for a finite time scale at which the degree of “self-relatedness” is not singular, but still larger than the “relatedness” averaged over the whole hypersphere. Examining the asymptotic behavior of the exact heat kernel in high feature dimension shows that an appropriate time scale is ; in this regime the numerical sum in Theorem 1 satisfies a stopping condition at low orders in and the sample points are in moderate diffusion proximity to each other so that they can be accurately classified (Supplementary Material, Section S2.5.4).
Figure 1A illustrates the diffusion process captured by our exact kernel in three feature dimensions at time , for . In Figure 1B, we systematically compared the behavior of (1) dimension-independent parametrix kernel at time and (2) exact kernel on at for and . By symmetry, the slope of vanished at the south pole for any time and dimension . In sharp contrast, had a negative slope at , again highlighting a singular behavior of the parametrix kernel. The “relatedness” measured by at the sweet spot was finite over the whole hypersphere with sufficient contrast between nearby and far away points. Moreover, the characteristic behavior of at did not change significantly for different values of the feature dimension , confirming that the optimal for many classification applications will likely reside near the “sweet spot” .
2.3 SVM classifications
Linear SVM seeks a separating hyperplane that maximizes the margin, i.e. the distance to the nearest data point. The primal formulation of SVM attempts to minimize the norm of the weight vector that is normal to the separating hyperplane, subject to either hard or soft margin constraints. In the so-called Lagrange dual formulation of SVM, one applies the Representer Theorem to rewrite the weight as a linear combination of data points; in this set-up, the dot products of data points naturally appear, and kernel SVM replaces the dot product operation with a chosen kernel evaluation. The ultimate hope is that the data points will become linearly separable in the new feature space implicitly defined by the kernel.
We evaluated the performance of kernel SVM using the
linear kernel ,
Gaussian RBF ,
cosine kernel ,
parametrix kernel , and
exact kernel ,
on two independent data sets: (1) WebKB data of websites from four universities (WebKB-4-University) , and (2) glioblastoma multiforme (GBM) mutation data from The Cancer Genome Atlas (TCGA) with 5-fold cross-validations (CV) (Supplementary Material, Section S1). The WebKB-4-University data contained 4199 documents in total comprising four classes: student (1641), faculty (1124), course (930), and project (504); in our analysis, however, we selected an equal number of representative samples from each class, so that the training and testing sets had balanced classes. Table 1 shows the average optimal prediction accuracy scores of the five kernels for a varying number of representative samples, using 393 most frequent word features (Supplementary Material, Section S1). The exact kernel outperformed the Gaussian RBF and parametrix kernel, reducing the error by and by , respectively. Changing the feature dimension did not affect the performance much (Table 2).
In the TCGA-GBM data, there were 497 samples, and we aimed to impute the mutation status of one gene – i.e., mutant or wild-type – from the mutation counts of other genes. For each imputation target, we first counted the number of mutant samples and then selected an equal number of wild-type samples for 5-fold CV. Imputation tests were performed for top 102 imputable genes (Supplementary Material, Section S1). Table 3 shows the average prediction accuracy scores for 5 biologically interesting genes known to be important for cancer :
For the remaining genes, the exact kernel generally outperformed the linear, cosine and parametrix kernels (Figure 2). However, even though the exact kernel dramatically outperformed the Gaussian RBF in the WebKB-4-University classification problem, the advantage of the exact kernel in this mutation analysis was not evident (Figure 2). It is possible that the radial degree of freedom in this case, corresponding to the genome-wide mutation load in each sample, contained important covariate information not captured by the hyperspherical heat kernel. The difference in accuracy between the hyperspherical kernels (cos, prx, and ext) and the Euclidean kernels (lin and rbf) also hinted some weak dependence on class size (Figure 2), or equivalently the sample size . In fact, the level of accuracy showed much stronger correlation with the “effective sample size” related to the empirical Vapnik-Chervonenkis (VC) dimension [36, 4, 7, 37, 38] of a kernel SVM classifier (Figure 3A-E); moreover, the advantage of the exact kernel over the Guassian RBF kernel grew with the effective sample size ratio (Figure 3F, Supplementary Material, Section S2.5.5).
By construction, our definition of the hyperspherical map exploits only the positive portion of the whole hypersphere, where the parametrix and exact heat kernels seem to have similar performances. However, if we allow the data set to assume negative values, i.e. the feature space is the usual instead of , then we may apply the usual projective map, where each vector in the Euclidean space is normalized by its -norm. As shown in Figure 1B, the parametrix kernel is singular at and qualitatively deviates from the exact kernel for large values of . Thus, when data points populate the whole hypersphere, we expect to find more significant differences in performance between the exact and parametrix kernels. For example, Table 4 shows the kernel SVM classifications of 91 S&P500 Financials stocks against 64 Information Technology stocks () using their log-return instances between January 5, 2015 and November 18, 2016 as features. As long as the number of features was greater than sample size, , the exact kernel outperformed all other kernels and reduced the error of Gaussian RBF by and that of parametrix kernel by .
This paper has constructed the exact hyperspherical heat kernel using the complete basis of high-dimensional angular momentum eigenfunctions and tested its performance in kernel SVM. We have shown that the exact kernel and cosine kernel, both of which employ the hyperspherical maps, often outperform the Gaussian RBF and linear kernels. The advantage of using hyperspherical kernels likely arises from the hyperspherical maps of feature space, and the exact kernel may further improve the decision boundary flexibility of the raw cosine kernel. To be specific, the hyperspherical maps remove the less informative radial degree of freedom in a nonlinear fashion and compactify the Euclidean feature space into a unit hypersphere where all data points may then be enclosed within a finite radius. By contrast, our numerical estimations using TCGA-GBM data show that for linear kernel SVM, the margin tends to be much smaller than the data range in order to accommodate the separation of strongly mixed data points of different class labels; as a result, the ratio was much larger than that for cosine kernel SVM. This insight may be summarized by the fact that the upper bound on the empirical VC-dimension of linear kernel SVM tends to be much larger than that for cosine kernel SVM, especially in high dimensions, suggesting that the cosine kernel SVM is less sensitive to noise and more generalizable to unseen data. The exact kernel is equipped with an additional tunable hyperparameter, namely the diffusion time , which adjusts the curvature of nonlinear decision boundary and thus adds to the advantage of hyperspherical maps. Moreover, the hyperspherical kernels often have larger effective sample sizes than their Euclidean counterparts and, thus, may be especially useful for analyzing data with a small sample size in high feature dimensions.
The failure of the parametrix expansion of heat kernel, especially in dimensions , signals a dramatic difference between diffusion in a non-compact space and that on a compact manifold. It remains to be examined how these differences in diffusion process, random walk and topology between non-compact Euclidean spaces and compact manifolds like a hypersphere help improve clustering performance as supported by the results of this paper.
This research was supported by a Distinguished Scientist Award from Sontag Foundation and the Grainger Engineering Breakthroughs Initiative.
We thank Alex Finnegan and Hu Jin for critical reading of the manuscript and helpful comments. We also thank Mohith Manjunath for his help with the TCGA data.
S1 Data preparation and SVM classification
The WebKB-4-University raw webpage data were downloaded from http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/ and processed with the python packages Beautiful Soup and Natural Language Toolkit (NLTK). Our feature extraction excluded punctuation marks and included only letters and numerals where capital letters were all converted to lower case and each individual digit 0-9 was represented by a “#.” Very infrequent words, such as misspelled words, non-English words, and words mixed with special characters, were filtered out. We selected top most frequent words as features in our classification tests; the cutoff was chosen to select frequent words whose counts across all webpage documents are greater than of the total number of documents. There were 4199 documents in total: student (1641), faculty (1124), course (930), and project (504).
The TCGA-GBM data were downloaded from the GDC Data Portal under the name TCGA-GBM Aggregated Somatic Mutation. The mutation count data set was extracted from the MAF file, while ignoring the detailed types of mutations and counting only the total number of mutations in each gene. Very infrequently, mutated genes were filtered out if the total number of mutations in one gene across all samples is less than of the total number of samples ( samples and genes). We imputed the mutation status of one gene, mutant or wild-type, from the mutation counts of the remaining genes. The most imputable genes were selected using 5-fold cross-validation linear kernel SVM. Most of the mutant and wild-type samples were highly unbalanced, the ratio being typically around ; therefore, unthresholded area-under-the-curve (AUC) of the receiver operating characteristic (ROC) curve was used to quantify the classification performance of the linear kernel SVM. Mutated genes with AUC greater than were selected for the subsequent imputation tests.
To balance the sample size between classes, we performed K-means clustering of samples within each class, with a specified number of centroids and took the samples closest to each centroid as representatives. For the WebKB document classifications, we used , and K-means clustering was performed in each of the four classes separately; for the TCGA-GBM data, was chosen to be the number of samples in each mutant (minority) class, and K-means clustering was performed in the wild-type (majority) class. Since K-means might depend on the random initialization, we performed the clustering 50 times and selected the top most frequent representatives. Five-fold stratified cross-validations (CV) were performed on the resulting balanced data sets, where training and test samples were drawn without replacement from each class. The mean CV accuracy scores across the five folds were recorded.
S2 Hyperspherical Heat Kernel
s2.1 Laplacian on a Riemannian manifold
The Laplacian on a Riemannian manifold with metric is the operator
where , and the Einstein summation convention is used. It can be also written in terms of the covariant derivative as
The covariant derivative satisfies the following properties
where is the Levi-Civita connection satisfying and . To show Equation S2, recall that the Levi-Civita connection is uniquely determined by the geometry, or the metric tensor, as
Using the formula for determinant differentiation
we can thus write
Hence, for any ,
s2.2 The induced metric on
The -sphere embedded in can be parameterized as
where , for , and .
Let denote the Jacobian matrix for the above coordinate transformation. The square of the line element in is given by
Restricted to ,
Therefore, on , we have
Hence, the induced metric on embedded in is