Deep Equals Shallow for ReLU Networks in Kernel Regimes
Abstract
Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain overparameterized regime, where the kernel is determined by the architecture and initialization, and this paper focuses on approximation for such kernels. We show that for ReLU activations, the kernels derived from deep fullyconnected networks have essentially the same approximation properties as their “shallow” twolayer counterpart, namely the same eigenvalue decay for the corresponding integral operator. This highlights the limitations of the kernel framework for understanding the benefits of such deep architectures. Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function, which also easily applies to the study of other kernels defined on the sphere.
1 Introduction
The question of which functions can be well approximated by neural networks is crucial for understanding when these models are successful, and has always been at the heart of the theoretical study of neural networks (e.g., Hornik et al., 1989; Pinkus, 1999). While early works have mostly focused on shallow networks with only two layers, more recent works have shown benefits of deep networks for approximating certain classes of functions (Eldan and Shamir, 2016; Mhaskar and Poggio, 2016; Telgarsky, 2016; Daniely, 2017; Yarotsky, 2017; SchmidtHieber, 2020). Unfortunately, many of these approaches rely on constructions that are not currently known to be learnable using efficient algorithms.
A separate line of work has considered overparameterized networks with random neurons (Neal, 1996), which also display universal approximation properties while additionally providing efficient algorithms based on kernel methods or their approximations such as random features (Rahimi and Recht, 2007; Bach, 2017b). Many recent results on gradientbased optimization of certain overparameterized networks have been shown to be equivalent to kernel methods with an architecturespecific kernel called the neural tangent kernel (NTK) and thus also fall in this category (e.g., Jacot et al., 2018; Li and Liang, 2018; AllenZhu et al., 2019; Du et al., 2019a, b; Zou et al., 2019). This regime has been coined lazy (Chizat et al., 2019), as it does not capture the common phenomenon where weights move significantly away from random initialization and thus may not provide a satisfying model for learning adaptive representations, in contrast to other settings such as the mean field or active regime, which captures complex training dynamics where weights may move in a nontrivial manner and adapt to the data (e.g., Chizat and Bach, 2018; Mei et al., 2018). Nevertheless, one benefit compared to the mean field regime is that the kernel approach easily extends to deep architectures, leading to compositional kernels similar to the ones of Cho and Saul (2009); Daniely et al. (2016). Our goal in this paper is to study the role of depth in determining approximation properties for such kernels, with a focus on fullyconnected deep ReLU networks.
Our approximation results rely on the study of eigenvalue decays of integral operators associated to the obtained dotproduct kernels on the sphere, which are diagonalized in the basis of spherical harmonics. This provides a characterization of the functions in the corresponding reproducing kernel Hilbert space (RKHS) in terms of their smoothness, and leads to convergence rates for nonparametric regression when the data are uniformly distributed on the sphere. We show that for ReLU networks, the eigenvalue decays for the corresponding deep kernels remain the same regardless of the depth of the network. Our key result is that the decay for a certain class of kernels is characterized by a property related to differentiability of the kernel function around the point where the two inputs are aligned. In particular, the property is preserved when adding layers with ReLU activations, showing that depth plays essentially no role for such networks in kernel regimes. This highlights the limitations of the kernel regime for understanding the power of depth in fullyconnected networks, and calls for new models of deep networks beyond kernels (see, e.g., AllenZhu and Li, 2020; Chen et al., 2020, for recent works in this direction). We also provide applications of our result to other kernels and architectures, and illustrate our results with numerical experiments on synthetic and real datasets.
Related work.
Kernels for deep learning were originally derived by Neal (1996) for shallow networks, and later for deep networks (Cho and Saul, 2009; Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018). Smola et al. (2001) study regularization properties of dotproduct kernels on the sphere using spherical harmonics, and Bach (2017a) derives eigenvalue decays for such dotproduct kernels arising from shallow networks with positively homogeneous activations including the ReLU. Extensions to shallow NTK or Laplace kernels are studied by Basri et al. (2019); Bietti and Mairal (2019b); Geifman et al. (2020); Ghorbani et al. (2019). The observation that depth does not change the decay of the NTK was previously made by Basri et al. (2020) empirically, and Geifman et al. (2020) provide a lower bound on the eigenvalues for deep networks; our work makes this observation rigorous by providing tight asymptotic decays. Azevedo and Menegatto (2014); Scetbon and Harchaoui (2020) also study eigenvalue decays for dotproduct kernels but focus on kernels with geometric decays, while our main focus is on polynomial decays. We note that we recently became aware of the concurrent work (Chen and Xu, 2020) which obtains related results using a different analysis.
2 Review of Approximation with DotProduct Kernels
In this section, we provide a brief review of the kernels that arise from neural networks and their approximation properties.
2.1 Kernels for wide neural networks
Wide neural networks with random weights or weights close to random initialization naturally lead to certain dotproduct kernels that depend on the architecture and activation function, which we now present, with a focus on fullyconnected architectures.
Random feature kernels.
We first consider a twolayer (shallow) network taking the form , for some activation function . When are fixed and only are trained with regularization, this corresponds to using a random feature approximation Rahimi and Recht (2007) of the kernel
(1) 
If are on the sphere, then by spherical symmetry of the Gaussian distribution, one may show that is invariant to unitary transformations and takes the form for a certain function . More precisely, if is the decomposition of in the basis of Hermite polynomials , which are orthogonal w.r.t. the Gaussian measure, then we have (Daniely et al., 2016):
(2) 
Conversely, given a kernel function of the form above with with , one may construct corresponding activations using Hermite polynomials by taking
(3) 
In the case where is positively homogeneous, such as the ReLU (with ), or more generally , then the kernel (1) takes the form for any . This leads to RKHS functions of the form , with defined on the sphere (Bietti and Mairal, 2019b, Prop. 8).
In particular, for the step and ReLU activations and , the functions are given by the following arccosine kernels (Cho and Saul, 2009):
(4) 
Note that given a kernel function , the corresponding activations (3) will generally not be homogeneous, thus the inputs to a random network with such activations need to lie on the sphere (or be appropriately normalized) in order to yield the kernel .
Extension to deep networks.
When considering a deep network with more than two layers and fixed random weights before the last layer, the connection to random features is less direct since the features are correlated through intermediate layers. Nevertheless, when the hidden layers are wide enough, one still approaches a kernel obtained by letting the widths go to infinity (see, e.g., Daniely et al., 2016; Lee et al., 2018; Matthews et al., 2018), which takes a similar form to the multilayer kernels of Cho and Saul (2009):
for on the sphere, where is obtained as described above for a given activation , and is the number of layers. It is usually good to normalize such that , so that we also have , avoiding exploding or vanishing behavior for deep networks. In practice, this corresponds to using an activationdependent scaling in the random initialization, which is commonly used by practitioners (He et al., 2015).
Neural tangent kernels.
When intermediate layers are trained along with the last layer using gradient methods, the resulting problem is nonconvex and the statistical properties of such approaches are not well understood in general, particularly for deep networks. However, in a specific overparameterized regime, it may be shown that gradient descent can reach a global minimum while keeping weights very close to random initialization. More precisely, for a network parameterized by with large width , the model remains close to its linearization around random initialization throughout training, that is, . This is also known as the lazy training regime (Chizat et al., 2019). Learning is then equivalent to a kernel method with another architecturespecific kernel known as the neural tangent kernel (NTK, Jacot et al., 2018), given by
(5) 
For a simple twolayer network with activation , it is then given by
(6) 
For a ReLU network with layers with inputs on the sphere, taking appropriate limits on the widths, one can show (Jacot et al., 2018): , with and for ,
(7) 
where and are given in (4).
2.2 Approximation and harmonic analysis with dotproduct kernels
In this section, we recall approximation properties of dotproduct kernels on the sphere, through spectral decompositions of integral operators in the basis of spherical harmonics. Further background is provided in Appendix A.
Spherical harmonics and description of the RKHS.
A standard approach to study the RKHS of a kernel is through the spectral decomposition of an integral operator given by for some measure , leading to Mercer’s theorem (e.g., Cucker and Smale, 2002). When inputs lie on the sphere in dimensions, dotproduct kernels of the form are rotationallyinvariant, depending only on the angle between and . Similarly to how translationinvariant kernels are diagonalized in the Fourier basis, rotationinvariant kernels are diagonalized in the basis of spherical harmonics (Smola et al., 2001; Bach, 2017a), which lead to connections between eigenvalue decays and regularity as in the Fourier setting. In particular, if denotes the uniform measure on , then , where is the th spherical harmonic polynomial of degree , where plays the role of a frequency as in the Fourier case, and the number of such orthogonal polynomials of degree is given by , which grows as for large . The eigenvalues only depend on the frequency and are given by
(8) 
where is the Legendre polynomial of degree in dimensions (also known as Gegenbauer polynomial when using a different scaling), and denotes the surface of the sphere . Mercer’s theorem then states that the RKHS associated to the kernel is given by
(9) 
In particular, if has a fast decay, then the coefficients of must also decay quickly with in order for to be in , which means must have a certain level of regularity. Similarly to the Fourier case, an exponential decay of implies that the functions in are infinitely differentiable, while for polynomial decay contains all functions whose derivatives only up to a certain order are bounded, as in Sobolev spaces. If two kernels lead to the same asymptotic decay of up to a constant, then by (9) their RKHS norms are equivalent up to a constant, and thus they have the same RKHS. For the specific case of random feature kernels arising from positively homogeneous activations, Bach (2017a) shows that decays as for of the opposite parity of , and is zero for large enough of opposite parity, which results in a RKHS that contains even or odd functions (depending on the parity of ) defined on the sphere with bounded derivatives up to order (note that must be greater than in order for the eigenvalues of to be summable and thus lead to a welldefined RKHS). Bietti and Mairal (2019b) show that the same decay holds for the NTK of twolayer ReLU networks, with and a change of parity. Basri et al. (2019) show that the parity constraints may be removed by adding a zeroinitialized additive bias term when deriving the NTK. We note that one can also obtain rates of approximation for Lipschitz functions from such decay estimates (Bach, 2017a). Our goal in this paper is to extend this to more general dotproduct kernels such as those arising from multilayer networks, by providing a more general approach for obtaining decay estimates from differentiability properties of the function .
Nonparametric regression.
When the data are uniformly distributed on the sphere, we may also obtain convergence rates for nonparametric regression, which typically depend on the eigenvalue decay of the integral operator associated to the marginal distribution on inputs and on the decomposition of the regression function on the same basis (e.g., Caponnetto and De Vito, 2007). Then one may achieve optimal rates that depend mainly on the regularity of when using various algorithms with tuned hyperparameters, but the choice of kernel and its decay may have an impact on the rates in some regimes, as well as on the difficulty of the optimization problem (see, e.g., Bach, 2013, Section 4.3).
3 Main Result and Applications to Deep Networks
In this section, we present our main results concerning approximation properties of dotproduct kernels on the sphere, and applications to the kernels arising from wide random neural networks. We begin by stating our main theorem, which provides eigenvalue decays for dotproduct kernels from differentiability properties of the kernel function at the endpoints . We then present applications of this result to various kernels, including those coming from deep networks, showing in particular that the RKHSs associated to deep and shallow ReLU networks are the same (up to parity constraints).
3.1 Statement of our main theorem
We now state our main result regarding the asymptotic eigenvalue decay of dotproduct kernels. Recall that we consider a kernel of the form for , and seek to obtain decay estimates on the eigenvalues defined in (8). We now state our main theorem, which derives the asymptotic decay of with in terms of differentiability properties of around , assuming that is infinitely differentiable on . This latter condition is always verified when takes the form of a power series (2) with , since the radius of convergence is at least .
Theorem 1 (Decay from regularity of at endpoints, simplified).
Let be a function that is on and has the following asymptotic expansions around :
(10)  
(11) 
where are polynomials and is not an integer. Then there is an absolute constant depending on and such that:

For even, if : ;

For odd, if : .
In the case , then we have for one of the two parities (or both if ). If is infinitely differentiable on so that no such exists, then decays faster than any polynomial.
The full theorem is given in Appendix B along with its proof, and requires a mild technical condition on the expansion which is verified for all kernels considered in this paper, namely, a finite number of terms in the expansions with exponents between and . The proof relies on integration by parts using properties of Legendre polynomials, in a way reminiscent of fast decays of Fourier series for differentiable functions, and on precise computations of the decay for simple functions of the form . This allows us to obtain the asymptotic decay for general kernel functions as long as the behavior around the endpoints is known, in contrast to previous approaches which rely on the precise form of , or of the corresponding activation in the case of arccosine kernels (Bach, 2017a; Basri et al., 2019; Bietti and Mairal, 2019b; Geifman et al., 2020). This enables the study of more general and complex kernels, such as those arising from deep networks, as discussed below. We also note that when is of the form , the exponent in Theorem 1 is related to the decay of coefficients , as these must be consistent with the fact that converges while diverges, assuming .
3.2 Consequences for ReLU networks
When considering neural networks with ReLU activations, the corresponding random features and neural tangent kernels depend on the arccosine functions and defined in (4). These have the following expansions (with generalized exponents) near :
(12)  
(13) 
Indeed, the first follows from integrating the expansion of the derivative using the relation and the second follows from the first using the expression of in (4). Near , we have by symmetry , and we have by using and . By Theorem 1, we immediately obtain a decay of for even coefficients for , and for odd coefficients for , recovering results of Bach (2017a). For the twolayer ReLU NTK, we have , leading to a similar expansion to and thus decay, up to a change of parity due to the factor which changes signs in the expansion around ; this recovers Bietti and Mairal (2019b). We note that for these specific kernels, Bach (2017a); Bietti and Mairal (2019b) show in addition that coefficients of the opposite parity are exactly zero for large enough , which imposes parity constraints on functions in the RKHS, although such a constraint may be removed in the NTK case by adding a zeroinitialized bias term (Basri et al., 2019), leading to a kernel .
Deep networks.
Recall from Section 2.1 that the RF and NTK kernels for deep ReLU networks may be obtained through compositions and products using the functions and . Since asymptotic expansions can be composed and multiplied, we can then obtain expansions for the deep RF and NTK kernels. The following results show that such kernels have the same eigenvalue decay as the ones for the corresponding shallow (twolayer) networks.
Corollary 2 (Deep RF decay.).
For the random neuron kernel of an layer ReLU network with , we have , where is different depending on the parity of and grows linearly with .
Corollary 3 (Deep NTK decay.).
For the neural tangent kernel of an layer ReLU network with , we have , where is different depending on the parity of and grows linearly with (but is bounded when considering the normalized NTK , which satisfies ).
The proofs, given in Appendix C, use the fact that and have the same noninteger exponent factors in their expansions, and similarly for and . One benefit compared to the shallow case is that the odd and even coefficients are both nonzero with the same decay, which removes the parity constraints, but as mentioned before, simple modifications of the shallow kernels can yield the same effect.
The finite neuron case.
For twolayer networks with a finite number of neurons, the obtained models correspond to random feature approximations of the limiting kernels (Rahimi and Recht, 2007). Then, one may approximate RKHS functions and achieve optimal rates in nonparametric regression as long as the number of random features exceeds a certain degreesoffreedom quantity (Bach, 2017b; Rudi and Rosasco, 2017), which is similar to standard such quantities in the analysis of ridge regression (Caponnetto and De Vito, 2007), at least when the data are uniformly distributed on the sphere (otherwise the quantity involved may be larger unless features are sampled nonuniformly). Such a number of random features is optimal for a given eigenvalue decay of the integral operator (Bach, 2017b), which implies that the shallow random feature architectures provides optimal approximation for the multilayer ReLU kernels as well, since the shallow and deep kernels have the same decay, up to the parity constraint. In order to overcome this constraint for shallow kernels while preserving decay, one may consider vectorvalued random features of the form with , leading to a kernel , where is the random feature kernel corresponding to . With , has the same decay as , and when it has the same decay as .
3.3 Extensions to other kernels
We now provide other examples of kernels for which Theorem 1 provides approximation properties thanks to its generality.
Laplace kernel and generalizations.
The Laplace kernel has been found to provide similar empirical behavior to neural networks when fitting randomly labeled data with gradient descent (Belkin et al., 2018). Recently, Geifman et al. (2020) have shown that when inputs are on the sphere, the Laplace kernel has the same decay as the NTK, which may suggest a similar conditioning of the optimization problem as for fullyconnected networks, as discussed in Section 2.2. Denoting so that , we may easily recover this result using Theorem 1 by noticing that is infinitely differentiable around and satisfies
which yields the same decay as the NTK. Geifman et al. (2020) also consider a heuristic generalization of the Laplace kernel with different exponents, . Theorem 1 allows us to obtain a precise decay for this kernel as well using , which is of the form for noninteger , and in particular approaches the limiting order of smoothness when .
Deep kernels with step activations.
We saw in Section 3.2 that for ReLU activations, depth does not change the decay of the corresponding kernels. In contrast, when considering step activations , we show in Appendix C.3 that approximation properties of the corresponding random neuron kernels (of the form ) improve with depth, leading to a decay with for layers. This also leads to an RKHS which becomes as large as allowed (order of smoothness close to ) when . While this may suggest a benefit of depth, note that step activations make optimization hard for anything beyond a linear regime with random weights, since the gradients with respect to inner neurons vanish.
Infinitely differentiable kernels.
Finally, we note that Theorem 1 shows that kernels associated to infinitely differentiable activations (which are themselves infinitely differentiable, see Daniely et al. (2016)), as well as Gaussian kernels on the sphere of the form , have decays faster than any polynomial. This results in a “small” RKHS that only contains smooth functions. See Azevedo and Menegatto (2014) for a more precise study of the decay for Gaussian kernels on the sphere.
4 Numerical experiments
We now present numerical experiments on synthetic and real data to illustrate our theory.
Synthetic experiments.
We consider randomly sampled inputs on the sphere in 4 dimensions, and outputs generated according to the following target models, for an arbitrary : and . Note that is discontinuous and thus not in the RKHS in general, while is in the RKHS of (since it is even and has the same decay as as discussed in Section 3.3). In Figure 1 we compare the quality of approximation for different kernels by examining generalization performance of ridge regression with exact kernels or random features. The regularization parameter is optimized on 10000 test datapoints on a logarithmic grid. In order to illustrate the difficulty of optimization due to a small optimal , which would also indicate slower convergence with gradient methods, we consider grids with , for two different choices of . We see that all kernels provide a similar rate of approximation for a large enough grid, but when fixing a smaller optimization budget by taking a larger , the NTK kernels can achieve better performance for large sample size . Figure 1(right) shows that when using random features (which can achieve optimal rates in some settings, see Rudi and Rosasco, 2017), the “shallow” ReLU network performs better than a threelayer version, despite having fewer weights. This suggests that in addition to providing no improvements to approximation in the infinitewidth case, the kernel regimes for deep ReLU networks may even be worse than their twolayer counterparts in the finitewidth setting.
MNIST and FashionMNIST.
In Table 1, we consider the image classification datasets MNIST and FashionMNIST, which both consist of 60k training and 10k test images of size 28x28 with 10 output classes. We evaluate oneversusall classifiers obtained by using kernel ridge regression by setting for the correct label and otherwise. We train on random subsets of 50k examples and use the remaining 10k examples for validation. We find that test accuracy is comparable for different numbers of layers in RF or NTK kernels, with a slightly poorer performance for the twolayer case likely due to parity constraints, in agreement with our theoretical result that the decay is the same for different . There is a small decrease in accuracy for growing , which may reflect changes in the decay constants or numerical errors when composing kernels. The slightly better performance of RF compared to NTK may suggest that these problems are relatively easy (e.g., the regression function is smooth), so that a faster decay is preferable due to better adaptivity to smoothness.
MNIST FMNIST
L  RF  NTK 

2  98.60 0.03  98.49 0.02 
3  98.67 0.03  98.53 0.02 
4  98.66 0.02  98.49 0.01 
5  98.65 0.04  98.46 0.02 
L  RF  NTK 

2  90.75 0.11  90.65 0.07 
3  90.87 0.16  90.62 0.08 
4  90.89 0.13  90.55 0.07 
5  90.88 0.08  90.50 0.05 
5 Discussion
In this paper, we have analyzed the approximation properties of deep networks in kernel regimes, by studying eigenvalue decays of integral operators through differentiability properties of the kernel function. In particular, the decay is governed by the form of the function’s (generalized) power series expansion around , which remains the same for kernels arising from fullyconnected ReLU networks of varying depths. This result suggests that the kernel approach is unsatisfactory for understanding the power of depth in fullyconnected networks. In particular, it highlights the need to incorporate other regimes in the study of deep networks, such as the mean field regime (Chizat and Bach, 2018; Mei et al., 2018), and other settings with hierarchical structure (see, e.g., AllenZhu and Li, 2020; Chen et al., 2020). We note that our results do not rule out benefits of depth for other network architectures in kernel regimes; for instance, depth may improve stability properties of convolutional kernels (Bietti and Mairal, 2019a, b), and a precise study of approximation for such kernels and its dependence on depth would also be of interest.
Acknowledgments
This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). We also acknowledge support of the European Research Council (grant SEQUOIA 724063).
Appendix A Background on Spherical Harmonics
In this section, we provide some background on spherical harmonics needed for our study of approximation. See (Efthimiou and Frye, 2014; Atkinson and Han, 2012; Ismail, 2005) for references, as well as (Bach, 2017a, Appendix D). We consider inputs on the sphere .
We recall some properties of the spherical harmonics introduced in Section 2.2. For , where , the spherical harmonics are homogeneous harmonic polynomials of degree that are orthonormal with respect to the uniform distribution on the sphere. The degree plays the role of an integer frequency, as in Fourier series, and the collection forms an orthonormal basis of . As with Fourier series, there are tight connections between decay of coefficients in this basis w.r.t. , and regularity/differentiability of functions, in this case differentiability on the sphere. This follows from the fact that spherical harmonics are eigenfunctions of the LaplaceBeltrami operator on the sphere (see Efthimiou and Frye, 2014, Proposition 4.5):
(14) 
For a given frequency , we have the following addition formula:
(15) 
where is the th Legendre polynomial in dimension (also known as Gegenbauer polynomial when using a different scaling), given by the Rodrigues formula:
Note that these may also be expressed using the hypergeometric function (see, e.g., Ismail, 2005, Section 4.5), an expression we will use in proof of Theorem 1 (see the proof of Lemma 6).
The polynomials are orthogonal in where the measure is given by the weight function , and we have
(16) 
where denotes the surface of the sphere in dimensions. Using the addition formula (15) and orthogonality of spherical harmonics, we can show
(17) 
We will use two other properties of Legendre polynomials, namely the following recurrence relation (Efthimiou and Frye, 2014, Eq. 4.36)
(18) 
for , and for we simply have , as well as the differential equation (see, e.g., Efthimiou and Frye, 2014, Proposition 4.20):
(19) 
The FunkHecke formula is helpful for computing Fourier coefficients in the basis of spherical harmonics in terms of Legendre polynomials: for any , we have
(20) 
For example, we may use this to obtain decompositions of dotproduct kernels by computing Fourier coefficients of functions . Indeed, denoting
writing the decomposition of using (20) leads to the following Mercer decomposition of the kernel:
(21) 
Appendix B Proof of Theorem 1
The proof of Theorem 1, stated below in full as Theorem 7, proceeds as follows. We first derive an upper bound on the decay of of the form (Lemma 5), which is weaker than the desired , by exploiting regularity properties of through integration by parts. The goal is then to apply this result on a function , where is a function that allows us to “cancel” the leading terms in the expansions of , while being simple enough that it allows a precise estimate of its decay. In the proof of Theorem 7, we follow this strategy by considering as a sum of functions of the form and , for which we provide a precise computation of the decay in Lemma 6.
Decay upper bound through regularity.
We begin by establishing a weak upper bound on the decay of (Lemma 5) by leveraging its regularity up to the terms of order . This is achieved by iteratively applying the following integration by parts lemma, which is conceptually similar to integrating by parts on the sphere by leveraging the spherical Laplacian relation (14) in Appendix A, but directly uses properties of and of Legendre polynomials instead (namely, the differential equation (19)). We note that the final statement in Theorem 1 on infinitely differentiable directly follows from Lemma 5.
Lemma 4 (Integration by parts lemma).
Let be a function that is on and such that . We have
(22)  
(23) 
with .
Proof.
In order to perform integration by parts, we use the following differential equation satisfied by Legendre polynomials (see, e.g., Efthimiou and Frye, 2014, Proposition 4.20):
(24) 
Using this equation, we may write for ,
(25)  
(26) 
We may integrate the second term by parts using
(27) 
Noting that the first term in (25) cancels out with the integral resulting from the second term in (27), we then obtain
Integrating by parts once more, the second term becomes
(28) 
The desired result follows. ∎
Lemma 5 (Weak upper bound on the decay).
Let be a function that is on and has the following expansions around :
(29)  
(30) 
where are polynomials and may be noninteger. Then we have
(31) 
Proof.
Let and for
(32) 
Then is on and has similar expansions to of the form
(33)  
(34) 
for some polynomials . We may apply Lemma 4 repeatedly as long as the terms in brackets vanish, until we obtain, for ,
(35) 
Given our choice for , we have , and for some . Since for any , we obtain . ∎
Precise decay for simple function.
We now provide precise decay estimates for functions of the form and , which will lead to the dominant terms in the decomposition of in the main theorem.
Lemma 6 (Decay for simple functions and ).
Let , with noninteger, and let denote its Legendre coefficients in dimensions given by . We have

if is odd

for even, , with a constant.
Analogously, let . We have

if is even

for odd, , with a constant.
Proof.
We recall the following representation of Legendre polynomials based on the hypergeometric function (e.g., Ismail, 2005, Section 4.5):
(36) 
where the hypergeometric function is given in its generalized form by
(37) 
where is the rising factorial or Pochhammer symbol.
We then have