Deep Equals Shallow for ReLU Networks in Kernel Regimes

# Deep Equals Shallow for ReLU Networks in Kernel Regimes

## Abstract

Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by the architecture and initialization, and this paper focuses on approximation for such kernels. We show that for ReLU activations, the kernels derived from deep fully-connected networks have essentially the same approximation properties as their “shallow” two-layer counterpart, namely the same eigenvalue decay for the corresponding integral operator. This highlights the limitations of the kernel framework for understanding the benefits of such deep architectures. Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function, which also easily applies to the study of other kernels defined on the sphere.

## 1 Introduction

The question of which functions can be well approximated by neural networks is crucial for understanding when these models are successful, and has always been at the heart of the theoretical study of neural networks (e.g., Hornik et al., 1989; Pinkus, 1999). While early works have mostly focused on shallow networks with only two layers, more recent works have shown benefits of deep networks for approximating certain classes of functions (Eldan and Shamir, 2016; Mhaskar and Poggio, 2016; Telgarsky, 2016; Daniely, 2017; Yarotsky, 2017; Schmidt-Hieber, 2020). Unfortunately, many of these approaches rely on constructions that are not currently known to be learnable using efficient algorithms.

A separate line of work has considered over-parameterized networks with random neurons (Neal, 1996), which also display universal approximation properties while additionally providing efficient algorithms based on kernel methods or their approximations such as random features (Rahimi and Recht, 2007; Bach, 2017b). Many recent results on gradient-based optimization of certain over-parameterized networks have been shown to be equivalent to kernel methods with an architecture-specific kernel called the neural tangent kernel (NTK) and thus also fall in this category (e.g., Jacot et al., 2018; Li and Liang, 2018; Allen-Zhu et al., 2019; Du et al., 2019a, b; Zou et al., 2019). This regime has been coined lazy (Chizat et al., 2019), as it does not capture the common phenomenon where weights move significantly away from random initialization and thus may not provide a satisfying model for learning adaptive representations, in contrast to other settings such as the mean field or active regime, which captures complex training dynamics where weights may move in a non-trivial manner and adapt to the data (e.g., Chizat and Bach, 2018; Mei et al., 2018). Nevertheless, one benefit compared to the mean field regime is that the kernel approach easily extends to deep architectures, leading to compositional kernels similar to the ones of Cho and Saul (2009); Daniely et al. (2016). Our goal in this paper is to study the role of depth in determining approximation properties for such kernels, with a focus on fully-connected deep ReLU networks.

Our approximation results rely on the study of eigenvalue decays of integral operators associated to the obtained dot-product kernels on the sphere, which are diagonalized in the basis of spherical harmonics. This provides a characterization of the functions in the corresponding reproducing kernel Hilbert space (RKHS) in terms of their smoothness, and leads to convergence rates for non-parametric regression when the data are uniformly distributed on the sphere. We show that for ReLU networks, the eigenvalue decays for the corresponding deep kernels remain the same regardless of the depth of the network. Our key result is that the decay for a certain class of kernels is characterized by a property related to differentiability of the kernel function around the point where the two inputs are aligned. In particular, the property is preserved when adding layers with ReLU activations, showing that depth plays essentially no role for such networks in kernel regimes. This highlights the limitations of the kernel regime for understanding the power of depth in fully-connected networks, and calls for new models of deep networks beyond kernels (see, e.g., Allen-Zhu and Li, 2020; Chen et al., 2020, for recent works in this direction). We also provide applications of our result to other kernels and architectures, and illustrate our results with numerical experiments on synthetic and real datasets.

#### Related work.

Kernels for deep learning were originally derived by Neal (1996) for shallow networks, and later for deep networks (Cho and Saul, 2009; Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018). Smola et al. (2001) study regularization properties of dot-product kernels on the sphere using spherical harmonics, and Bach (2017a) derives eigenvalue decays for such dot-product kernels arising from shallow networks with positively homogeneous activations including the ReLU. Extensions to shallow NTK or Laplace kernels are studied by Basri et al. (2019); Bietti and Mairal (2019b); Geifman et al. (2020); Ghorbani et al. (2019). The observation that depth does not change the decay of the NTK was previously made by Basri et al. (2020) empirically, and Geifman et al. (2020) provide a lower bound on the eigenvalues for deep networks; our work makes this observation rigorous by providing tight asymptotic decays. Azevedo and Menegatto (2014); Scetbon and Harchaoui (2020) also study eigenvalue decays for dot-product kernels but focus on kernels with geometric decays, while our main focus is on polynomial decays. We note that we recently became aware of the concurrent work (Chen and Xu, 2020) which obtains related results using a different analysis.

## 2 Review of Approximation with Dot-Product Kernels

In this section, we provide a brief review of the kernels that arise from neural networks and their approximation properties.

### 2.1 Kernels for wide neural networks

Wide neural networks with random weights or weights close to random initialization naturally lead to certain dot-product kernels that depend on the architecture and activation function, which we now present, with a focus on fully-connected architectures.

#### Random feature kernels.

We first consider a two-layer (shallow) network taking the form , for some activation function . When  are fixed and only  are trained with  regularization, this corresponds to using a random feature approximation Rahimi and Recht (2007) of the kernel

 k(x,x′)=Ew∼N(0,I)[σ(w⊤x)σ(w⊤x′)]. (1)

If  are on the sphere, then by spherical symmetry of the Gaussian distribution, one may show that  is invariant to unitary transformations and takes the form  for a certain function . More precisely, if  is the decomposition of  in the basis of Hermite polynomials , which are orthogonal w.r.t. the Gaussian measure, then we have (Daniely et al., 2016):

 κ(u)=∑i≥0a2iui. (2)

Conversely, given a kernel function of the form above with  with , one may construct corresponding activations using Hermite polynomials by taking

 σ(u)=∑iaihi(u),ai∈{±√bi}. (3)

In the case where  is -positively homogeneous, such as the ReLU  (with ), or more generally , then the kernel (1) takes the form  for any . This leads to RKHS functions of the form , with  defined on the sphere (Bietti and Mairal, 2019b, Prop. 8). In particular, for the step and ReLU activations  and , the functions  are given by the following arc-cosine kernels (Cho and Saul, 2009):4

 κ0(u)=1π(π−arccos(u)), κ1(u)=1π(u⋅(π−arccos(u))+√1−u2). (4)

Note that given a kernel function , the corresponding activations (3) will generally not be homogeneous, thus the inputs to a random network with such activations need to lie on the sphere (or be appropriately normalized) in order to yield the kernel .

#### Extension to deep networks.

When considering a deep network with more than two layers and fixed random weights before the last layer, the connection to random features is less direct since the features are correlated through intermediate layers. Nevertheless, when the hidden layers are wide enough, one still approaches a kernel obtained by letting the widths go to infinity (see, e.g., Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018), which takes a similar form to the multi-layer kernels of Cho and Saul (2009):

 kL(x,x′)=κL(x⊤x′):=κ∘⋯∘κL−1 times(x⊤x′),

for  on the sphere, where  is obtained as described above for a given activation , and  is the number of layers. It is usually good to normalize  such that , so that we also have , avoiding exploding or vanishing behavior for deep networks. In practice, this corresponds to using an activation-dependent scaling in the random initialization, which is commonly used by practitioners (He et al., 2015).

#### Neural tangent kernels.

When intermediate layers are trained along with the last layer using gradient methods, the resulting problem is non-convex and the statistical properties of such approaches are not well understood in general, particularly for deep networks. However, in a specific over-parameterized regime, it may be shown that gradient descent can reach a global minimum while keeping weights very close to random initialization. More precisely, for a network  parameterized by  with large width , the model remains close to its linearization around random initialization  throughout training, that is, . This is also known as the lazy training regime (Chizat et al., 2019). Learning is then equivalent to a kernel method with another architecture-specific kernel known as the neural tangent kernel (NTK, Jacot et al., 2018), given by

 kNTK(x,x′)=limm→∞⟨∇f(x;θ0),∇f(x′;θ0)⟩. (5)

For a simple two-layer network with activation , it is then given by

 kNTK(x,x′)=(x⊤x′) Ew[σ′(w⊤x)σ′(w⊤x′)]+Ew[σ(w⊤x)σ(w⊤x′)]. (6)

For a ReLU network with  layers with inputs on the sphere, taking appropriate limits on the widths, one can show (Jacot et al., 2018): , with  and for ,

 κℓ(u) =κ1(κℓ−1(u)) κℓNTK(u) =κℓ−1NTK(u)κ0(κℓ−1(u))+κℓ(u), (7)

where  and  are given in (4).

### 2.2 Approximation and harmonic analysis with dot-product kernels

In this section, we recall approximation properties of dot-product kernels on the sphere, through spectral decompositions of integral operators in the basis of spherical harmonics. Further background is provided in Appendix A.

#### Spherical harmonics and description of the RKHS.

A standard approach to study the RKHS of a kernel is through the spectral decomposition of an integral operator  given by  for some measure , leading to Mercer’s theorem (e.g., Cucker and Smale, 2002). When inputs lie on the sphere  in  dimensions, dot-product kernels of the form  are rotationally-invariant, depending only on the angle between  and . Similarly to how translation-invariant kernels are diagonalized in the Fourier basis, rotation-invariant kernels are diagonalized in the basis of spherical harmonics (Smola et al., 2001; Bach, 2017a), which lead to connections between eigenvalue decays and regularity as in the Fourier setting. In particular, if  denotes the uniform measure on , then , where  is the -th spherical harmonic polynomial of degree , where  plays the role of a frequency as in the Fourier case, and the number of such orthogonal polynomials of degree  is given by , which grows as  for large . The eigenvalues  only depend on the frequency  and are given by

 μk=ωd−2ωd−1∫1−1κ(t)Pk(t)(1−t2)(d−3)/2dt, (8)

where  is the Legendre polynomial of degree  in  dimensions (also known as Gegenbauer polynomial when using a different scaling), and  denotes the surface of the sphere . Mercer’s theorem then states that the RKHS  associated to the kernel is given by

 H=⎧⎨⎩f=∑k≥0,μk≠0N(d,k)∑j=1ak,jYk,j(⋅)~{}~{}~{}~{}s.t.~{}~{}~{}∥f∥2H:=∑k≥0,μk≠0N(d,k)∑j=1a2k,jμk<∞⎫⎬⎭. (9)

In particular, if  has a fast decay, then the coefficients  of  must also decay quickly with  in order for  to be in , which means  must have a certain level of regularity. Similarly to the Fourier case, an exponential decay of  implies that the functions in  are infinitely differentiable, while for polynomial decay  contains all functions whose derivatives only up to a certain order are bounded, as in Sobolev spaces. If two kernels lead to the same asymptotic decay of  up to a constant, then by (9) their RKHS norms are equivalent up to a constant, and thus they have the same RKHS. For the specific case of random feature kernels arising from -positively homogeneous activations, Bach (2017a) shows that  decays as  for  of the opposite parity of , and is zero for large enough  of opposite parity, which results in a RKHS that contains even or odd functions (depending on the parity of ) defined on the sphere with bounded derivatives up to order  (note that  must be greater than  in order for the eigenvalues of  to be summable and thus lead to a well-defined RKHS). Bietti and Mairal (2019b) show that the same decay holds for the NTK of two-layer ReLU networks, with  and a change of parity. Basri et al. (2019) show that the parity constraints may be removed by adding a zero-initialized additive bias term when deriving the NTK. We note that one can also obtain rates of approximation for Lipschitz functions from such decay estimates (Bach, 2017a). Our goal in this paper is to extend this to more general dot-product kernels such as those arising from multi-layer networks, by providing a more general approach for obtaining decay estimates from differentiability properties of the function .

#### Non-parametric regression.

When the data are uniformly distributed on the sphere, we may also obtain convergence rates for non-parametric regression, which typically depend on the eigenvalue decay of the integral operator associated to the marginal distribution on inputs and on the decomposition of the regression function  on the same basis (e.g., Caponnetto and De Vito, 2007). Then one may achieve optimal rates that depend mainly on the regularity of  when using various algorithms with tuned hyperparameters, but the choice of kernel and its decay may have an impact on the rates in some regimes, as well as on the difficulty of the optimization problem (see, e.g., Bach, 2013, Section 4.3).

## 3 Main Result and Applications to Deep Networks

In this section, we present our main results concerning approximation properties of dot-product kernels on the sphere, and applications to the kernels arising from wide random neural networks. We begin by stating our main theorem, which provides eigenvalue decays for dot-product kernels from differentiability properties of the kernel function  at the endpoints . We then present applications of this result to various kernels, including those coming from deep networks, showing in particular that the RKHSs associated to deep and shallow ReLU networks are the same (up to parity constraints).

### 3.1 Statement of our main theorem

We now state our main result regarding the asymptotic eigenvalue decay of dot-product kernels. Recall that we consider a kernel of the form  for , and seek to obtain decay estimates on the eigenvalues  defined in (8). We now state our main theorem, which derives the asymptotic decay of  with  in terms of differentiability properties of  around , assuming that  is infinitely differentiable on . This latter condition is always verified when  takes the form of a power series (2) with , since the radius of convergence is at least .

###### Theorem 1 (Decay from regularity of κ at endpoints, simplified).

Let  be a function that is  on  and has the following asymptotic expansions around :

 κ(1−t) =p1(t)+c1tν+o(tν) (10) κ(−1+t) =p−1(t)+c−1tν+o(tν), (11)

where  are polynomials and  is not an integer. Then there is an absolute constant  depending on  and  such that:

• For  even, if : ;

• For  odd, if : .

In the case , then we have  for one of the two parities (or both if ). If  is infinitely differentiable on  so that no such  exists, then  decays faster than any polynomial.

The full theorem is given in Appendix B along with its proof, and requires a mild technical condition on the expansion which is verified for all kernels considered in this paper, namely, a finite number of terms in the expansions with exponents between  and . The proof relies on integration by parts using properties of Legendre polynomials, in a way reminiscent of fast decays of Fourier series for differentiable functions, and on precise computations of the decay for simple functions of the form . This allows us to obtain the asymptotic decay for general kernel functions  as long as the behavior around the endpoints is known, in contrast to previous approaches which rely on the precise form of , or of the corresponding activation in the case of arc-cosine kernels (Bach, 2017a; Basri et al., 2019; Bietti and Mairal, 2019b; Geifman et al., 2020). This enables the study of more general and complex kernels, such as those arising from deep networks, as discussed below. We also note that when  is of the form , the exponent  in Theorem 1 is related to the decay of coefficients , as these must be consistent with the fact that  converges while  diverges, assuming .

### 3.2 Consequences for ReLU networks

When considering neural networks with ReLU activations, the corresponding random features and neural tangent kernels depend on the arc-cosine functions  and  defined in (4). These have the following expansions (with generalized exponents) near :

 κ0(1−t) =1−√2πt1/2+O(t3/2) (12) κ1(1−t) =1−t+2√23πt3/2+O(t5/2). (13)

Indeed, the first follows from integrating the expansion of the derivative using the relation and the second follows from the first using the expression of  in (4). Near , we have by symmetry , and we have by using  and . By Theorem 1, we immediately obtain a decay of  for even coefficients for , and  for odd coefficients for , recovering results of Bach (2017a). For the two-layer ReLU NTK, we have , leading to a similar expansion to  and thus decay, up to a change of parity due to the factor  which changes signs in the expansion around ; this recovers Bietti and Mairal (2019b). We note that for these specific kernels, Bach (2017a); Bietti and Mairal (2019b) show in addition that coefficients of the opposite parity are exactly zero for large enough , which imposes parity constraints on functions in the RKHS, although such a constraint may be removed in the NTK case by adding a zero-initialized bias term (Basri et al., 2019), leading to a kernel .

#### Deep networks.

Recall from Section 2.1 that the RF and NTK kernels for deep ReLU networks may be obtained through compositions and products using the functions  and . Since asymptotic expansions can be composed and multiplied, we can then obtain expansions for the deep RF and NTK kernels. The following results show that such kernels have the same eigenvalue decay as the ones for the corresponding shallow (two-layer) networks.

###### Corollary 2 (Deep RF decay.).

For the random neuron kernel  of an -layer ReLU network with , we have , where  is different depending on the parity of  and grows linearly with .

###### Corollary 3 (Deep NTK decay.).

For the neural tangent kernel  of an -layer ReLU network with , we have , where  is different depending on the parity of  and grows linearly with  (but is bounded when considering the normalized NTK , which satisfies ).

The proofs, given in Appendix C, use the fact that  and  have the same non-integer exponent factors in their expansions, and similarly for  and . One benefit compared to the shallow case is that the odd and even coefficients are both non-zero with the same decay, which removes the parity constraints, but as mentioned before, simple modifications of the shallow kernels can yield the same effect.

#### The finite neuron case.

For two-layer networks with a finite number of neurons, the obtained models correspond to random feature approximations of the limiting kernels (Rahimi and Recht, 2007). Then, one may approximate RKHS functions and achieve optimal rates in non-parametric regression as long as the number of random features exceeds a certain degrees-of-freedom quantity (Bach, 2017b; Rudi and Rosasco, 2017), which is similar to standard such quantities in the analysis of ridge regression (Caponnetto and De Vito, 2007), at least when the data are uniformly distributed on the sphere (otherwise the quantity involved may be larger unless features are sampled non-uniformly). Such a number of random features is optimal for a given eigenvalue decay of the integral operator (Bach, 2017b), which implies that the shallow random feature architectures provides optimal approximation for the multi-layer ReLU kernels as well, since the shallow and deep kernels have the same decay, up to the parity constraint. In order to overcome this constraint for shallow kernels while preserving decay, one may consider vector-valued random features of the form  with , leading to a kernel , where  is the random feature kernel corresponding to . With  has the same decay as , and when  it has the same decay as .

### 3.3 Extensions to other kernels

We now provide other examples of kernels for which Theorem 1 provides approximation properties thanks to its generality.

#### Laplace kernel and generalizations.

The Laplace kernel  has been found to provide similar empirical behavior to neural networks when fitting randomly labeled data with gradient descent (Belkin et al., 2018). Recently, Geifman et al. (2020) have shown that when inputs are on the sphere, the Laplace kernel has the same decay as the NTK, which may suggest a similar conditioning of the optimization problem as for fully-connected networks, as discussed in Section 2.2. Denoting  so that , we may easily recover this result using Theorem 1 by noticing that  is infinitely differentiable around  and satisfies

 κc(1−t)=e−c√t=1−c√t+O(t),

which yields the same decay  as the NTK. Geifman et al. (2020) also consider a heuristic generalization of the Laplace kernel with different exponents, . Theorem 1 allows us to obtain a precise decay for this kernel as well using , which is of the form  for non-integer , and in particular approaches the limiting order of smoothness  when .

#### Deep kernels with step activations.

We saw in Section 3.2 that for ReLU activations, depth does not change the decay of the corresponding kernels. In contrast, when considering step activations , we show in Appendix C.3 that approximation properties of the corresponding random neuron kernels (of the form ) improve with depth, leading to a decay  with for  layers. This also leads to an RKHS which becomes as large as allowed (order of smoothness close to ) when . While this may suggest a benefit of depth, note that step activations make optimization hard for anything beyond a linear regime with random weights, since the gradients with respect to inner neurons vanish.

#### Infinitely differentiable kernels.

Finally, we note that Theorem 1 shows that kernels associated to infinitely differentiable activations (which are themselves infinitely differentiable, see Daniely et al. (2016)), as well as Gaussian kernels on the sphere of the form , have decays faster than any polynomial. This results in a “small” RKHS that only contains smooth functions. See Azevedo and Menegatto (2014) for a more precise study of the decay for Gaussian kernels on the sphere.

## 4 Numerical experiments

We now present numerical experiments on synthetic and real data to illustrate our theory.

#### Synthetic experiments.

We consider randomly sampled inputs on the sphere  in 4 dimensions, and outputs generated according to the following target models, for an arbitrary : and . Note that  is discontinuous and thus not in the RKHS in general, while  is in the RKHS of  (since it is even and has the same decay as  as discussed in Section 3.3). In Figure 1 we compare the quality of approximation for different kernels by examining generalization performance of ridge regression with exact kernels or random features. The regularization parameter  is optimized on 10000 test datapoints on a logarithmic grid. In order to illustrate the difficulty of optimization due to a small optimal , which would also indicate slower convergence with gradient methods, we consider grids with , for two different choices of . We see that all kernels provide a similar rate of approximation for a large enough grid, but when fixing a smaller optimization budget by taking a larger , the NTK kernels can achieve better performance for large sample size . Figure 1(right) shows that when using  random features (which can achieve optimal rates in some settings, see Rudi and Rosasco, 2017), the “shallow” ReLU network performs better than a three-layer version, despite having fewer weights. This suggests that in addition to providing no improvements to approximation in the infinite-width case, the kernel regimes for deep ReLU networks may even be worse than their two-layer counterparts in the finite-width setting.

#### MNIST and Fashion-MNIST.

In Table 1, we consider the image classification datasets MNIST and Fashion-MNIST, which both consist of 60k training and 10k test images of size 28x28 with 10 output classes. We evaluate one-versus-all classifiers obtained by using kernel ridge regression by setting  for the correct label and  otherwise. We train on random subsets of 50k examples and use the remaining 10k examples for validation. We find that test accuracy is comparable for different numbers of layers in RF or NTK kernels, with a slightly poorer performance for the two-layer case likely due to parity constraints, in agreement with our theoretical result that the decay is the same for different . There is a small decrease in accuracy for growing , which may reflect changes in the decay constants or numerical errors when composing kernels. The slightly better performance of RF compared to NTK may suggest that these problems are relatively easy (e.g., the regression function is smooth), so that a faster decay is preferable due to better adaptivity to smoothness.

## 5 Discussion

In this paper, we have analyzed the approximation properties of deep networks in kernel regimes, by studying eigenvalue decays of integral operators through differentiability properties of the kernel function. In particular, the decay is governed by the form of the function’s (generalized) power series expansion around , which remains the same for kernels arising from fully-connected ReLU networks of varying depths. This result suggests that the kernel approach is unsatisfactory for understanding the power of depth in fully-connected networks. In particular, it highlights the need to incorporate other regimes in the study of deep networks, such as the mean field regime (Chizat and Bach, 2018; Mei et al., 2018), and other settings with hierarchical structure (see, e.g., Allen-Zhu and Li, 2020; Chen et al., 2020). We note that our results do not rule out benefits of depth for other network architectures in kernel regimes; for instance, depth may improve stability properties of convolutional kernels (Bietti and Mairal, 2019a, b), and a precise study of approximation for such kernels and its dependence on depth would also be of interest.

### Acknowledgments

This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). We also acknowledge support of the European Research Council (grant SEQUOIA 724063).

## Appendix A Background on Spherical Harmonics

In this section, we provide some background on spherical harmonics needed for our study of approximation. See (Efthimiou and Frye, 2014; Atkinson and Han, 2012; Ismail, 2005) for references, as well as (Bach, 2017a, Appendix D). We consider inputs on the sphere .

We recall some properties of the spherical harmonics  introduced in Section 2.2. For , where , the spherical harmonics  are homogeneous harmonic polynomials of degree  that are orthonormal with respect to the uniform distribution  on the  sphere. The degree  plays the role of an integer frequency, as in Fourier series, and the collection  forms an orthonormal basis of . As with Fourier series, there are tight connections between decay of coefficients in this basis w.r.t. , and regularity/differentiability of functions, in this case differentiability on the sphere. This follows from the fact that spherical harmonics are eigenfunctions of the Laplace-Beltrami operator on the sphere  (see Efthimiou and Frye, 2014, Proposition 4.5):

 ΔSd−1Yk,j=−k(k+d−2)Yk,j. (14)

For a given frequency , we have the following addition formula:

 N(d,k)∑j=1Yk,j(x)Yk,j(y)=N(d,k)Pk(x⊤y), (15)

where  is the -th Legendre polynomial in dimension  (also known as Gegenbauer polynomial when using a different scaling), given by the Rodrigues formula:

 Pk(t)=(−1/2)kΓ(d−12)Γ(k+d−12)(1−t2)(3−d)/2(ddt)k(1−t2)k+(d−3)/2.

Note that these may also be expressed using the hypergeometric function  (see, e.g., Ismail, 2005, Section 4.5), an expression we will use in proof of Theorem 1 (see the proof of Lemma 6).

The polynomials  are orthogonal in  where the measure is given by the weight function , and we have

 ∫1−1P2k(t)(1−t2)(d−3)/2dt=ωd−1ωd−21N(d,k), (16)

where  denotes the surface of the sphere  in  dimensions. Using the addition formula (15) and orthogonality of spherical harmonics, we can show

 ∫Pj(w⊤x)Pk(w⊤y)dτ(w)=δjkN(d,k)Pk(x⊤y) (17)

We will use two other properties of Legendre polynomials, namely the following recurrence relation (Efthimiou and Frye, 2014, Eq. 4.36)

 tPk(t)=k2k+d−2Pk--1(t)+k+d−22k+d−2Pk+1(t), (18)

for , and for we simply have , as well as the differential equation  (see, e.g., Efthimiou and Frye, 2014, Proposition 4.20):

 (1−t2)P′′k(t)+(1−d)tP′k(t)+k(k+d−2)Pk(t)=0. (19)

The Funk-Hecke formula is helpful for computing Fourier coefficients in the basis of spherical harmonics in terms of Legendre polynomials: for any , we have

 ∫f(x⊤y)Yk,j(y)dτ(y)=ωd−2ωd−1Yk,j(x)∫1−1f(t)Pk(t)(1−t2)(d−3)/2dt. (20)

For example, we may use this to obtain decompositions of dot-product kernels by computing Fourier coefficients of functions . Indeed, denoting

 μk=ωd−2ωd−1∫1−1κ(t)Pk(t)(1−t2)(d−3)/2dt,

writing the decomposition of  using (20) leads to the following Mercer decomposition of the kernel:

 κ(x⊤y)=∞∑k=0μkN(d,k)∑j=1Yk,j(x)Yk,j(y)=∞∑k=0μkN(d,k)Pk(x⊤y). (21)

## Appendix B Proof of Theorem 1

The proof of Theorem 1, stated below in full as Theorem 7, proceeds as follows. We first derive an upper bound on the decay of  of the form  (Lemma 5), which is weaker than the desired , by exploiting regularity properties of  through integration by parts. The goal is then to apply this result on a function , where  is a function that allows us to “cancel” the leading terms in the expansions of , while being simple enough that it allows a precise estimate of its decay. In the proof of Theorem 7, we follow this strategy by considering  as a sum of functions of the form  and , for which we provide a precise computation of the decay in Lemma 6.

#### Decay upper bound through regularity.

We begin by establishing a weak upper bound on the decay of  (Lemma 5) by leveraging its regularity up to the terms of order . This is achieved by iteratively applying the following integration by parts lemma, which is conceptually similar to integrating by parts on the sphere by leveraging the spherical Laplacian relation (14) in Appendix A, but directly uses properties of  and of Legendre polynomials instead (namely, the differential equation (19)). We note that the final statement in Theorem 1 on infinitely differentiable  directly follows from Lemma 5.

###### Lemma 4 (Integration by parts lemma).

Let  be a function that is  on  and such that . We have

 ∫1−1κ(t)Pk(t)(1−t2)d−32dt =1k(k+d−2)(−κ(t)(1−t2)1+d−32P′k(t)∣∣1−1 (22) (23)

with .

###### Proof.

In order to perform integration by parts, we use the following differential equation satisfied by Legendre polynomials (see, e.g., Efthimiou and Frye, 2014, Proposition 4.20):

 (1−t2)P′′k(t)+(1−d)tP′k(t)+k(k+d−2)Pk(t)=0. (24)

Using this equation, we may write for ,

 ∫1−1κ(t)Pk(t)(1−t2)(d−3)/2dt =1k(k+d−2)((d−1)∫tκ(t)P′k(t)(1−t2)d−32dt (25) −∫κ(t)P′′k(t)(1−t2)1+d−32dt). (26)

We may integrate the second term by parts using

 ddt(κ(t)(1−t2)1+d−32) =κ′(t)(1−t2)1+d−32−2t(1+(d−3)/2)κ(t)(1−t2)d−32 =κ′(t)(1−t2)1+d−32−(d−1)tκ(t)(1−t2)d−32. (27)

Noting that the first term in (25) cancels out with the integral resulting from the second term in (27), we then obtain

 ∫1−1κ(t)Pk(t)(1−t2)(d−3)/2dt =1k(k+d−2)(−κ(t)(1−t2)1+d−32P′k(t)∣∣1−1 +∫1−1κ′(t)(1−t2)1+d−32P′k(t)dt).

Integrating by parts once more, the second term becomes

 ∫1−1κ′(t)(1−t2)1+d−32P′k(t)dt =κ′(t)(1−t2)1+d−32Pk(t)∣∣1−1 −∫1−1(κ′′(t)(1−t2)−(d−1)tκ′(t))Pk(t)(1−t2)(d−3)/2dt. (28)

The desired result follows. ∎

###### Lemma 5 (Weak upper bound on the decay).

Let  be a function that is  on  and has the following expansions around :

 κ(t) =p1(1−t)+O((1−t)ν) (29) κ(t) =p−1(1+t)+O((1+t)ν), (30)

where  are polynomials and  may be non-integer. Then we have

 μk(κ)=O(k−d−2ν+3). (31)
###### Proof.

Let  and for

 fj(t):=−f′′j−1(t)(1−t2)+(d−1)f′j−1(t). (32)

Then  is  on  and has similar expansions to  of the form

 fj(t) =pj,1(1−t)+O((1−t)ν−j) (33) fj(t) =pj,−1(1+t)+O((1+t)ν−j), (34)

for some polynomials . We may apply Lemma 4 repeatedly as long as the terms in brackets vanish, until we obtain, for ,

 μk(κ)=1(k(k+d−2))j+1(f′j(t)(1−t2)1+d−32Pk(t)∣∣1−1+∫1−1fj+1(t)Pk(t)(1−t2)(d−3)/2dt). (35)

Given our choice for , we have , and for some . Since  for any , we obtain . ∎

#### Precise decay for simple function.

We now provide precise decay estimates for functions of the form  and , which will lead to the dominant terms in the decomposition of  in the main theorem.

###### Lemma 6 (Decay for simple functions ϕν and ¯ϕν).

Let , with  non-integer, and let  denote its Legendre coefficients in  dimensions given by . We have

• if  is odd

• for  even, , with  a constant.

Analogously, let . We have

• if  is even

• for  odd, , with  a constant.

###### Proof.

We recall the following representation of Legendre polynomials based on the hypergeometric function (e.g., Ismail, 2005, Section 4.5):5

 Pk(t)=2F1(−k,k+d−2;(d−1)/2;(1−t)/2), (36)

where the hypergeometric function is given in its generalized form by

 pFq(a1,…,ap;b1,…,bq;x)=∞∑s=0(a1)s⋯(ap)s(b1)s⋯(bq)sxss!, (37)

where  is the rising factorial or Pochhammer symbol.

We then have

 ∫1−1(1−t2)ν+d−32Pk(t)dt =22ν+d−3∫1−1(1−t2)ν+d−32(1+t2)ν+d−32Pk(t)dt =22ν+d−3k∑s=0(−k)s(d−2+k)s(d−12)ss!∫1−1(1−t2)ν+d−32+s(1+t2)ν+d−32dt =22ν+d−2k∑s=0(−k)s(d−2+k)s(d−12)ss!∫10(1−x)ν+d−32+sxν+d−32dx