Relative concentration bounds for the spectrum of kernel matrices

Relative concentration bounds for the spectrum of kernel matrices

Abstract

In this paper we study the concentration properties for the eigenvalues of kernel matrices, which are central objects in a wide range of kernel methods and, more recently, in network analysis. We present a set of concentration inequalities tailored for each individual eigenvalue of the kernel matrix with respect to its known asymptotic limit. The inequalities presented here are of relative type, meaning that they scale with the eigenvalue in consideration, which results in convergence rates that vary across the spectrum. The rates we obtain here are faster than the typical and are often exponential, depending on regularity assumptions of Sobolev type. One key feature of our results is that they apply to non positive kernels, which is fundamental in the context of network analysis. We show how our results are well suited for the study of dot product kernels, which are related to random geometric graphs on the sphere, via the graphon formalism. We illustrate our results by applying them to a variety of dot product kernels on the sphere and to the one dimensional Gaussian kernel.

[
\kwd
\arxiv\runtitle

Relative concentration bounds for the spectrum of kernel matrices

{aug}

class=MSC] \kwd[Primary ]68Q32 \kwd[; secondary ]60F99 \kwd68T01

Kernel matrix \kwdGraphon spectrum \kwdRelative perturbation inequality \kwdRandom geometric graph \kwdErdös-Renyi graphon

1 Introduction

Kernel methods have become nowadays an important tool in machine learning, with a wide range of applications including principal component analysis (PCA), clustering, non-parametric estimation and, more recently, statistical analysis of networks. Many of these methods rely on the spectral decomposition of a data dependent random matrix, which is commonly known as the kernel matrix, constructed by evaluating a symmetric kernel on a set of sample points. In PCA for instance [23, Sec.5.1], the original data is reduced to a low dimensional invariant subspace of a particular kernel matrix, based on the selection of a group of leading eigenvalues (those with larger value). An accurate estimation of the size of the eigenvalues of the kernel matrix is fundamental for obtaining theoretical guarantees for the error in this context.

The study of the spectra of kernel matrices, and the related sample covariance matrices, has a long history. In a seminal paper, Marchenko and Pastur [36] described the limit spectral distribution of random matrices of the form , where is a matrix with centered independent entries with finite variance, in the regime when the ratio converge to a fixed strictly positive number. Given that goes to infinity, this line of work is often referred to as the high-dimensional setting (typically the ambient space is Euclidean of dimension and is the sample size). In the high dimensional setting, El Karoui [17, 16] studied the asymptotic and finite sample spectral properties of matrices of the form , where is applied entrywise, and proved that under some regularity conditions on the kernel, the spectrum is essentially a linear deformation of the Marchenko-Pastur law. This results has been generalized, first in [11] and later in [15], where many regularity hypothesis have been removed.

In this paper we do not deal with the high dimensional setting. Instead we place ourselves in the so-called low-dimensional context (or fixed dimensional given that the data dimension is fixed, while the sample size grows) which has also attract interest lately. In [31], the authors prove that under mild regularity (integrability) conditions, the spectrum of the kernel matrix, after proper normalization, converges to the spectrum of an infinite dimensional object: the integral operator associated with this kernel. The convergence is stated in terms of what they call metric, which is a -type metric when the spectrum of the kernel matrix and the spectrum of the operator are regarded as elements of a sequence space (since we only deal with compact operators, this is possible). They also obtain a CLT describing the law of the fluctuation of the eigenvalues. Finite sample results for the same metric have later been obtained in [13], where the authors study the problem of graphon estimation. Similar results have been obtained in the positive semidefinite case in [44].

In the language of matrix perturbation, the metric is an example of an absolute measure for the deviation of the eigenvalues, because it provides a uniform control over the spectrum. In relative bounds, the difference between eigenvalues of two matrices is weighted by a function of the eigenvalues themselves. For example, if and are two symmetric matrices, and their eigenvalues are indexed from to decreasingly, then examples of relative measures for the deviation of the -th eigenvalue are and . Those measures are often called of Weyl-type, because they involve single eigenvalues. There also exists relative measures for a group of eigenvalues (including the full spectrum) which are often obtained by adding the single eigenvalue bounds, for a set of indices. Those bound are often referred to as of Hoffmann-Weiland type, after the classic matrix eigenvalue inequality with that name. Relative inequalities are known for achieving better accuracy in both, the deterministic [28] and the probabilistic setting [8].

Our main results are relative concentration inequalities of the Weyl-type for kernels which are not necessarily positive, but which satisfy some regularity assumptions related to the pointwise convergence of the kernel’s spectral expansion. In symbols, given a kernel we obtain inequalities of the form where is a positive number (usually ), is an integral operator associated to , is a normalization of the kernel matrix and the eigenvalues are in the decreasing order for their absolute value. This type of relative bound have also been called scaling bounds [8], because the limit eigenvalue (which is an eigenvalue of the limiting integral operator) plays the role of a scaling term in the final error bound, and have an important effect on the estimated convergence rate. This scaling term will allow us to obtain rates that are better than parametric and often exponential (or almost exponential) in the sample size, similar to those obtained in [3] under different methods and with more restrictive hypothesis. Indeed, our results show that relative bounds are not only more accurate, but they also represent an alternative argument to the better than parametric rates presented in [3]. Formally, our rates have three factors: one scaling term, one variance term and one concentration term. In the case of exponential rates is the scaling term the one that prevails. We show how the effect of the scaling term on the eigenvalue convergence rates became more explicit when considering the regularity hypothesis on the kernel, which are related to the eigenvalue decay and the growth of the eigenfunctions in the norm. We assume three type of regularity hypothesis which are common in the kernel literature and apply for a wide set of kernels.

Many of the relative bounds or scaling bound in the literature are better adapted to the case when an index is fixed and describe the behavior of in terms of (as our first main result does). One of our contributions is to obtain inequalities for a varying , which are specially useful when considering the smaller eigenvalues (in absolute value) or in the case of a fixed finite sample size. Our concentration bounds change depending on the index of the eigenvalue and they will exhibit mixed tail regimes. Our approach can be divided in three steps, which we call approximation, perturbation and concentration steps, depending on the main techniques used in each one. In the approximation step, we consider a finite rank approximation of the kernel and the kernel matrix is decomposed accordingly in two parts: the truncated and the residual. In the perturbation step we use (deterministic) relative perturbation to bound . The obtained bound depend on two random terms (one is related to the residual of the approximation step) that we call noise terms. In the concentration step, we bound the noise terms using tail bound inequalities for -statistics, such as those in [19].

We apply our results to the study of a widely used family of radially symmetric kernels, which are known as dot product kernels. We consider the Euclidean sphere as the ambient space, equipped with the geodesic distance. In this case the following remarkable property holds: the eigenfunctions do not depend on the kernel itself, but only on the dimension of the sphere. In other words, the basis is fixed for every kernel in this family and coincides with the basis of spherical harmonics. Their growth rates are known, which fix one of the regularity parameters and the regularity hypothesis will depend on the kernel eigenvalues only. This simplify the main results and allow us to express them in terms of the dimension of the sphere and the eigenvalue decay rate. In this context, a bounded kernel can be identified with a graphon, which represent the limit of the sequences of dense graphs, following the theory of dense graph limits pioneered by Lovasz and collaborators [34, 6, 5]. This makes relative concentration for single eigenvalues relevant in network analysis, specially given the recent development of methods and algorithms working with the spectrum in the contexts of testing [9], graphon estimation [13] and latent distance recovery in geometric graphs [2]. This type of kernel has also found applications in the context of deep learning [10].

Overview

The present paper is structured as follows: in Section 2 we introduce the basic material for integral operators and detail the related literature. In Section 3 we state the main results and in Section 3.1 we explain our three step approach and state the main propositions used in the proof of the main results. In Section 4 we compare the obtained rates of convergence with results in the literature. In Section 5 we explain the connection between our regularity hypothesis for the kernel (decay-growth assumptions on the eigensystem) and the classical definition of Sobolev-type regularity. We explain in Section 6.1 the connection between kernels and random graphs via the graphon model. In Section 6, we adapt our results to the particular case of dot product kernels, and give some examples in the case of geometric random graphs, obtaining explicit expressions for the deviation of the spectrum. All the proofs are in the Appendix.

Notation

We use the symbol “” to denote inequality up to constants, that is: for real functions, iff for some . Similarly, we use , which means that , to stress the fact that will depend on . We use the classic asymptotic notation in a similar way: (resp. ) means that (resp. ), for larger than some . We use to denote the operator norm of matrices and/or operators. Given that we deal only with selfadjoint compact operators, the operator norm is defined as the largest singular value.

2 Preliminaries

Given a probability space , a kernel is defined as a symmetric measurable function . We will assume, here and thereafter, that is a square integrable with respect to the product measure . Each kernel has an associated integral operator defined by the relation

By a classical result of functional analysis, is a compact operator([21, p.216]) and by the spectral theorem for compact operators, the spectrum of is a numerable set, with as its only accumulation point (in consequence, every eigenvalue of , excluding , has finite multiplicity). We index the sequence of eigenvalues in decreasing order with respect to its absolute value, that is . Thus for , will denote the -largest eigenvalue in absolute value. On the other hand, we will use to denote the infinite ordered spectrum (with the decreasing ordering defined above). Thus, the sequence can be seen as an element of , the space of sequences that converge to . In many cases it will be useful to consider the spectrum as a sequence, rather than an abstract set.

Another consequence of the spectral theorem is that the eigenvectors sequence form a Hilbertian basis of . Furthermore, we can decompose the kernel in the sense as follows

(1)

where . We will say that a kernel has rank if the sequence has exactly non-zero elements. If there are infinite non-zero eigenvalues, we say that the kernel has infinite rank. Stronger notions of converge hold for (1), under additional assumptions on . For instance, in the classical Mercer theorem we have that the convergence in (1) is uniform [39], provided that is continuous and positive.

Given i.i.d -valued random variables with common law , we define the matrix (with a slight abuse of notation)

which we call the kernel matrix(sometimes is called empirical matrix). We are interested a normalized version of the kernel matrix, defined in terms of entries as , which can be regarded as the empirical version of . Notice that the matrix is symmetric, hence it has real eigenvalues. We use the notation and for the spectrum of in an analogous way as in the infinite dimensional case ( is a sequence and a real number), completing the sequence with zeros to obtain a bonafide element of . Our main objective is to quantify how similar the spectrum of and are, but for a single position and not for the whole sequence. For instance, how close is to , which represent a pseudometric in the space .

Observe that the spectrum of depend on the random sample , hence is random, while the spectrum of the operator is deterministic and fixed(do not depend on ).

2.1 Related work

In [31] the authors study the asymptotic properties of using the metric, defined as follows. Given two sequences and on ,

where is the set of permutations with finite support. The almost sure convergence as is proven. They also derived a central limit theorem, giving the law of fluctuations of around . In [13] the authors obtained a concentration bound for with respect to its mean and apply this to the non-parametric estimation of graphons. The authors prove that a non-parametric rate is achieved, which depend on the regularity (of the Sobolev type) of the kernel in consideration. In [44], the positive definite case is considered and a parametric rate is obtained. Given that , those results already give an upper bound for the single eigenvalue convergence rate and serve as benchmarks. However, this bound gives a uniform control on the eigenvalues deviation, which in many cases is too rough and, in consequence, we expect an overestimation of , specially for the smaller eigenvalues as noted in [8].

In [8] the authors obtain relative concentration bounds for single eigenvalues. They call them scaling bounds because the error term scales with the eigenvalue in consideration, which frequently guarantees a more accurate estimation. Their results assume the positive definiteness of the kernel, which as we mentioned, might be restrictive for some application, notably in the context of networks. The rates we obtain are not only an improvement over those in [8], but also work in the more general framework of indefinite kernels. In particular, we prove that certain term of bias introduced by the truncation approach (which consists in approximate an infinite rank kernel for a finite rank one) do not have an impact in the rate for a great portion of the spectrum and we have at least parametric rate (both elements are not present in [8]).

In [3] the authors take an approximation theoretic point of view, proving that for very regular (infinitely differentiable) radially symmetric kernels, the spectrum of the integral operator decays (almost) exponentially and, in addition, the eigenvalues of the kernel matrix are almost exponentially close to those of . Those results are in line to the ones we obtain in the case of exponential decay of the eigenvalues. Results in [3] are however obtained using Hilbert space methods, which in principle are only valid for positive kernels (which can be identified with an RKHS) and it is not clear how to extend them to the indefinite case.

There are many results dealing with the related subject of concentration for sample covariance matrices (or operators) such as [32, 42, 29]. In these works, relative concentration inequalities of the type are proven, where (resp. ) is the population (resp. empirical) covariance matrix. They obtain results of the form , where is the effective rank and is proportional to . The main difference with our results is that in general they consider random vectors with a fixed sub-Gaussian norm, which here is not the case. The hypothesis of fourth moments, considered for example in [42], is not well adapted to the context of dot product kernels(see Sect. 6). In addition, a key element of some of the results in this line is the use of the RKHS technology which do not have direct extension for the indefinite case.

3 Main results

For our main result we introduce a summability hypothesis for the spectral expansion of the kernel given in (1). This guarantees the pointwise convergence of the expansion necessary when evaluating the eigenfunctions in the sample points . We also introduce three, more specialized, hypothesis that are common in the kernel literature, which are the polynomial decay of the eigenvalues, on one side, and the exponential decay, on the other. These hypotheses are complemented with growth hypothesis on the eigenvectors, which together will imply the summability of the kernel spectral expansion. This hypotheses are satisfied by many widely used kernels.

More formally, we will assume that the kernel satisfies which we call hypothesis H. In Lemma 22 we prove that H implies that the spectral expansion of the kernel converges almost surely. This is the main hypothesis for Theorem 1 below. Observe that under H, the operator is trace-class, that is . Notice that this hypothesis is stronger than the bounded kernel assumption, but weaker than assuming a Mercer kernel, which is a common assumption in the kernel literature. Since our results are in high probability, the uniform converge, given for instance by the Mercer theorem, is not necessary. In Theorem 2 we will assume that for that where is either polynomial (hypothesis ) or exponential (hypothesis and ). These two rates regimes are common assumptions in theoretical analysis of kernel methods, see [8, 49, 42].

To quantify the possible growth of the eigenvectors norm, we will assume under and . In the case of , we will assume that . We do not consider the case polynomial decrease in the eigenvalues and exponential increase of the eigenfunctions, since in this case the kernel expansion series diverge and it is incompatible with H. Indeed, the hypothesis H impose restrictions on the possible values of and , which are summarized in the Table 1 below. We will always assume that , which is not necessary in our proofs, but it allow us to have consistency with the classical definition of regularity as explained in Section 5.

Assumption

 

H

 

Table 1: Hypotheses for the eigenvalue decay and the growth of the eigenvectors

The following theorem works under the H hypothesis. We define

which works as variance proxy, in the sense of concentration inequalities [7], in what follows.

Theorem 1.

Let be a kernel which eigensystem satisfies H. Fix and define

Then there exists such that for and for we have

with probability larger than .

Remark 1.

From the fact that is trace-class, it follows that and converge to as tends to infinity (see the proof of Theorem 1 in Sec.A.7). Consequently the set in the definition of is non-empty. In Table 2 we show bounds for the term , under the polynomial and exponential decay assumptions for the eigenvalues.

The that appears in Theorem 1 depends on in general. This dependence will be made explicit in the proof (Appendix A.7). One of the consequences of Theorem 1 is that attains a parametric rate in terms of , when is fixed and is large enough, while maintaining the scaling term . The latter will be fundamental to obtain faster rates, under assumptions . Stated this way, this result is close to the CLT proven in [31], which says that when the eigenvalues of are simple (multiplicity one) then the following convergence in law holds , where is the generalized Brownian bridge associated with (a centered Gaussian process indexed by functions whose covariance is same as defined by ). Note that Theorem 1 in this respect is close to this asymptotic result, except that the variance term involves not only , but all the functions up to . This variance term is similar to those appearing in the (absolute) bounds in positive kernel literature [45] and [4]. In the aforementioned results, the variance term is the radius of smaller ball, in a Hilbert space, that contains the evaluation of feature maps. Here is the radius of the smaller ball, in the norm, that contains the evaluation of the eigenfunctions.

The following theorem works under the more specialized hypothesis and .

Theorem 2.

Let be a kernel satisfying one of the hypothesis or . Then with probability larger than we have a bound of the form

where depends on the respective hypothesis and is given by the following table

Assumption

Both theorems derive from the same set of results (this is explained in Section 3.1 below), but they are adapted for different situations. In Theorem 1 we have a fixed index and we allow the sample size to grow, obtaining a rate in terms of . In Theorem 2, we assume a fixed sample size and allows to vary with . Observe that the obtained rates change depending on the value of (relative to ). This is known phenomena in concentration inequalities, where often we have a mix between different tail regimes (frequently Gaussian and Exponential). In Theorem 1 this is less important as the focus is when is large, while is fixed, and the term prevails. Observe that Theorem 2 gives rates that are better than parametric under and exponential in cases and . This in line with some recent results such as [3, Thm.2]. A more detailed comparison with previous results in the literature is postponed to Section 4. In the next section we explain the ideas behind the proof of the main results.

Assumption
Table 2:

3.1 Three step approach: approximation, perturbation and concentration

The proofs of Theorems 1 and 2 are based on three steps: approximation, perturbation and concentration. In the approximation step we construct a finite rank version of which allows a direct comparison with . This gives a framework where can be seen as a random perturbation of . In the perturbation step, we use deterministic matrix perturbation to quantify the deviation between and . At the end of this step we obtain a scaling bound that depend on random error terms (which are written as the operator norm of random error matrices). Finally, in the concentration step we use concentration inequalities for -statistics to control the error terms. Throughout this section, we will assume that H holds.

Approximation step: The approximation step builds upon a truncation approach, used for example in [31, 13, 8], where we fix and decompose into two terms: the best rank approximation of , in the sense, and the residual term. More specifically, we define

(1)

We call the truncation parameter. Thus following decomposition holds

The previous equality holds in the almost sure sense, given H and Lemma 22. Here and thereafter we will work on the set of measure one where this equality holds true. We call the first term of the right hand side, the -truncated kernel matrix. Define the residual matrix (the error from the approximation) as

where the second equality is justified by the assumption H, which implies the pointwise equality. The -truncated kernel matrix can be written as a multiplicative perturbation of a diagonal matrix. More specifically, we have the following factorization

where is the matrix with columns , for , and is a diagonal matrix with in the diagonal. Thus, the normalized kernel matrix can be written as an additive perturbation of the -truncated matrix by the residual matrix

(2)

From the previous equality is already possible to obtain scaling bounds, as is done in [8]. The idea is that the first term has a structure that makes it compatible with standard tools of matrix concentration (after using a deterministic multiplicative perturbation theorem). The residual term has less structure, but its operator norm will be small in comparison to the -truncated matrix, provided that is well chosen. The fact that neither the -truncated matrix nor the residual are positive definite and invertible, prevents the use of most of the relative Weyl type bounds, see [28] for example. In [8] and [13] the classic (absolute) Weyl inequality is used. Intuitively speaking, the problem with that approach is that in absolute perturbation we consider the effect of the residual term over all the eigenvalues of the truncated matrix is uniform. In light of the asymptotic results in [31], given the eigenfunctions orthogonality, this should not be the case. To overcome this, we introduce another factorization instead, which allows to better exploit the multiplicative perturbation framework. The factorization is as follows

(3)

where

the columns of matrix are an orthonormal basis of the orthogonal complement to the space spanned by the columns of . Assume for the moment that the columns of are linearly independent. On that event, define the projection matrices and , the matrices and are specified by

Notice that the columns of are orthonormal, which is not the case of the columns of . In words, we decompose the residual matrix according to its projection onto the space generated by the columns of and its orthogonal complement.

Perturbation step: Define . We first use Weyl’s perturbation theorem to obtain the following

(4)

Here we use a relative multiplicative perturbation theorem known as Ostrowskii’s inequality [25, Thm.4.5.9, Cor. 4.5.11] (see [8, Cor.A.3] for the non-square case and see Remark 2 for the applicability with a different indexing than the non-increasing order) to obtain for all

(5)

where the last equality comes from the fact that has orthonormal columns. Using (5) and (4) we obtain for all

(6)

Because of the block structure of , we have that . The eigenvalues are deterministic, while is a random set. By the definition of , the following trivial bound holds

Concentration step: We use concentration inequalities to control with high probability the terms in the right hand side of (6) and will also serve us to characterize the random set . As we already mention, inequalities for quantities of the form are well established. For instance, given that can be written as a sum of independent random matrices, we can use the non-commutative Bernstein inequality. For instance, using the matrix Bernstein inequality [47, Thm 6.1] we obtain

Proposition 3.

With probability larger than we have

For the terms and , the standard tools in matrix concentration inequalities, do not seem to apply as smoothly. For instance, when has infinite rank, it cannot be expressed as a finite sum of independent matrices. Some concentration inequalities, such a the matrix bounded differences inequality [38, Cor. 6.1] or others obtained by the matrix Stein method [37] can be applied in this case, but they demand strong conditions such as almost sure control of a matrix variance proxy in the semi-definite order. In addition, in this case they deliver suboptimal results. We opt for using a rougher matrix norm inequality, for example using the Frobenius norm to control the operator norm, and we then use concentration inequalities for -statistics, such as those in [19, Thm.3.3] or [26, Thm.3.4]. This is can be seen as rough application of the comparison method described in [48], which consists in find an easier-to-bound random process majorizing the operator norm. Finding a tight majorizing random process is, in general, a challenging task and no canonical way to do this is known. The fact that we use a rougher bound for the matrix norm will be compensated by optimizing the choice of , which helps reducing the impact of this inaccuracy.

We have the following proposition which gives a tail bound for the terms , for , and .

Proposition 4.

We have with probability larger than

(7)
(8)

where

Define the event

Define . The following lemma, which proves that the event holds with high probability, is a direct consequence of Proposition 3.

Lemma 5.

For we have

Lemma 6.

Let be the linear span of . It holds

Let be the event such that (7) holds. In the event we have

For and we have, combining Lemma 6, Lemma 5 and Prop. 4

(9)

with probability larger than . The following two propositions help us to control for a fixed .

Proposition 7.

Assume that is such that . Then with probability larger than we have, for

(10)
Proposition 8.

Fix . We have, with probability larger than for

The proof of Theorem 1 uses Proposition 7. Indeed, it is easy to see that when and when . On the other hand, we have when . For a fixed and for large enough, we can always choose to satisfy and , then the Proposition 7 will imply the bound in Theorem 1. For Theorem 2, on the other hand, we use either Prop. 7 or Prop. 8 depending on the relative position of with respect to . The fact that we have explicit assumptions on the eigenvalues and eigenfunctions allow us to make the relation between , and more precise (see Lemmas 17-18, in the Appendix).

Remark 2 (Ostrowski’s inequality for the decreasing absolute value ordering).

As we mentioned above, the Ostrowski’s inequality is formulated in the case of non-decreasing (or equivalently non-increasing) ordering. Nonetheless, it is still valid for the decreasing ordering in the absolute value. Indeed, if is an ordering of the eigenvalues, that is  is bijective, we can reorder them in the non increasing order by applying a transformation to each , then apply the Owstroski’s inequality and finally apply . The key here is that this reordering process is applied to a finite matrix, because some orderings are not compatible with the operator full spectrum (the increasing ordering cannot be applied to the spectrum of an indefinite operator, given that is an accumulation point).

4 Asymptotic rate analysis and further related work

The rates obtained in Theorem 2 are expressed in terms of and , which is natural for a fixed , or purely in terms of , by replacing . Under , we obtained a parametric rate in terms of . Indeed, we can decompose this rate in (scaling and variance term), where , and a concentration term . This goes in line with the CLT in [31], where the same scaling and concentration terms appear. In that sense, the scaling and concentration terms seem to be optimal, while there is still room for some improvement in the variance term. If we allow to vary with , we obtained rates that are fully expressed in terms of , in which case they are always faster than for all three hypotheses , and .

In [33] and [42] formally similar relative bounds are proven, which are in line with sample covariance concentration results in [32]. In those articles the authors obtain bounds, for the difference of the empirical eigenvalue and the population eigenvalue of covariance matrices, which are of the form , where and is the population covariance matrix. In the case of [42], a different variance term is introduced, which depend on a regularization step based on shifting up the eigenvalues of . Those results are similar in form to Theorem 1, but the variance term differ. The term is is introduced as a measure of effective rank of the covariance matrix (or operator). Observe that is the same for all indices and do not change with , which is a difference with our approach. This is partly explained by the different context their work is devoted to. Indeed, one of the main assumptions is that the random vectors, of which is the population covariance, have a fixed subgaussian norm which do not change in terms of the ambient dimension. This is not the case in our context, as explained in Section 3.1 above, given that we do not have a fixed Euclidean space where random vectors are sampled. Otherwise stated, the random vectors we need control are not in a fixed with a fixed sub-Gaussian norm, but defined in and the truncation parameter has to be determined and the sub-Gaussian norm of the columns of will depend on . In [42], heavy tailed random vectors are considered, but the fourth moments assumption is not well adapted to our context (many of the kernels described in Section 6 will not satisfy that condition, because of the -norm growth of the spherical harmonics [20]). On the other hand, the works [33, 42, 29] are defined in the case of a positive matrix. Extensions to the case of non necessarily positive operators seem feasible, but they are not directly developed.

The asymptotic rates obtained in [8], for positive operators, are slower than those given in Theorem 2 under the three hypothesis , and . They do not assume explicit growth rates, but formulate their result under the assumption of bounded eigenfunctions (which falls in our framework with ) and bounded kernel function(which is implied by H). For example, in the case of polynomial decay of the eigenvalues and bounded kernel, they obtain an error rate of , which is slower than in terms of . We observe that for fixed we obtain a better rate for the error in terms of . Indeed, the rate in Theorem 2 under is , which is faster than parametric provided that (which we have to assume in order to satisfy H and which it is also assumed in [8]). Similar comparison can be stablished in the case of exponentially decay eigenvalues. More importantly, we avoid the cumbersome bias terms present for example in [8, Thm. 3].

The rate obtained in [3, Thm.2 ] for is , where is a positive constant and is a positive, radially symmetric, infinitely differentiable kernel defined on . Their result is obtained by using approximation theoretic methods and it is a measure independent bound, from which the aforementioned rate can be easily deduced. This represents an intermediate regime between and . Their rate do not seem to depend explicitly on the eigenfunctions growth, however the fact that the kernel is highly regular and radially symmetric would have an effect (this shares similarities with the case of dot product kernels presented in Section 6 below). At least formally, our results are aligned with those in [3], in the sense that when an exponential rate of the eigenvalues of is observed, the deviation will have an exponential rate (the scaling term prevails over the concentration term). It is worth mentioning that the approximation theoretic methods used in [3] rely heavily on the RKHS technology and do not extend automatically to the non positive case. Extension of this approach to the indefinite case, relaxing the symmetric and high regularity hypothesis, might be possible using for example the Krein spaces framework [41]. Such investigations are out of the scope of this paper and they are leaved for future work.

5 Eigenvalue decay revisited

The fact that a given kernel satisfy any of the hypothesis , and , of Theorem 2, is not necessarily easy to verify. If an eigenfunctions basis is known, such as the case of spherically symmetric kernels treated in Section 6 below, the eigenvalues can be obtained by the computing the integral of the product of the kernel with the eigenfunctions of the basis. There is no guarantee that an analytic close solution exist in general, but in practice this procedure can be done numerically. On the other hand, when the eigenfunctions of the kernel are not known, we are left to solve often complicated differential equations. For that reason is useful to have an equivalent notion of regularity at hand.

The decrease in the eigenvalues appears naturally as regularity hypothesis of the Sobolev-type. Indeed, given a measurable metric space where is a distance and a probability measure, we suppose that is an orthonormal basis of , where is a countable set. We define the weighted Sobolev space with associated positive weights as

We can take, as in Section 6.1, the measurable metric space and consider and . If is a basis of then a basis for is given by where . Observe that here . We note that for a kernel in with eigenvalues and eigenvectors , we have . If we want that the series in definition of the Sobolev space to converge, it is sufficient that where . This allow to control the decay behavior of by direct comparison to .

When is an open subset of , the classical definition of weighted Sobolev spaces makes use of the (weak)-derivatives of a function. If is a locally integrable function, we define the weighted Sobolev space as the normed space of locally integrable functions with weak derivatives such as the following norm is finite

where is a multiindex and are the weak derivatives.

For a symmetric kernel we can define the Sobolev regularity by the canonical embedding of into , but it seems more natural (see [49, sect. 2.2]) to say that the kernel satisfies the weighted Sobolev condition if for all . However, in some cases as in the dot product kernels, where there exists a real function such as , it is even more natural to say that that satisfies the Sobolev condition with weight if . Intuitively speaking, given that is defined on , it seems natural to carry out the analysis in one dimension.

In [40] is proved that in the one dimensional case, both definitions of weighted Sobolev spaces are coincident. Otherwise stated, the following equality between metric spaces holds where , with and . Here we recognize in the sequence of eigenvalues of the Laplace-Beltrami operator on and is the weight that defines the orthogonality relations between the Gegenbauer polynomials with . That means that two Gegenbauer polynomials of different degrees with are orthogonal in , which is the space of square integrable functions defined in with the weight . We denote the norm in , that is . In the next section, we explore this case in more detail and highlight the connection with random geometric graphs.

6 Dot product kernels

In this section we will consider the space , with , equipped with the geodesic distance and the measure , which is the surface (or uniform) measure normalized to be a probability measure. Let be a measurable function of the form . Note that the geodesic distance on the sphere is codified by the inner product, that is . Thus we directly assume, here and thereafter, that only depends on the inner product, that is

This family of kernels are usually known as dot product kernels and they are rotation invariant, that is for any rotation matrix , and its associated integral operator is a convolution operator. This type of kernel has been used in the context of random geometric graphs [13] and deep learning [10], to name a few. Similar to the context of Fourier analysis of one dimensional periodic functions, in this case we have a fixed Hilbertian basis of eigenvectors that only depends on the space , but not on the particular choice of kernel . The aforementioned basis is composed by the well-known spherical harmonics [12, chap. 1], which play the role of the Fourier basis in this case. For each we have an associated eigenspace , known as the space of spherical harmonics of order . Let be an orthonormal basis of and define , then by [12, cor. 1.1.4]

(11)

for and ,. The second equality follows easily from the definition (see Appendix B.1). We define , the eigenvalue of associated with the corresponding space . We use the subcript to difference this indexation(which follows the spherical harmonics order) from the decreasing order indexation . As sets and are equal (have the same elements), but in the eigenvalues are counted without multiplicity (except if is associated to more than one 1). This seems more natural in this case, but have to keep this in mind when applying Theorems 1 and 2. In this setting, the expansion (1) becomes

(12)

On the other hand, the addition theorem for spherical harmonics [12, eq. 1.2.8] gives

(13)

and the preceding equality does not depend on the particular choice of basis