On Kernel Derivative Approximation with Random Fourier Features
Abstract
Random Fourier features (RFF) represent one of the most popular and widespread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. To the best of our knowledge, the only existing areas where precise statisticalcomputational tradeoffs have been established are approximation of kernel values, kernel ridge regression, and kernel principal component analysis. Our goal is to spark the investigation of optimality of RFFbased approximations in tasks involving not only function values but derivatives, which naturally lead to optimization problems with kernel derivatives. Particularly, in this paper, we focus on the approximation quality of RFFs for kernel derivatives and prove that the existing finitesample guarantees can be improved exponentially in terms of the domain where they hold, using recent tools from unbounded empirical process theory. Our result implies that the same approximation guarantee is achievable for kernel derivatives using RFF as for kernel values.
1 Introduction
Kernel techniques [3, 30, 17] are among the most influential and widelyapplied tools, with significant impact on virtually all areas of machine learning and statistics. Their versatility stems from the function class associated to a kernel called reproducing kernel Hilbert space (RKHS) [2] which shows tremendous success in modelling complex relations.
The key property that makes kernel methods computationally feasible and the optimization over RKHS tractable is the representer theorem [11, 25, 39]. Particularly, given samples , consider the regularized empirical risk minimization problem specified by a kernel , the associated RKHS , a loss function , and a penalty parameter :
(1) 
where is the Hilbert space defined by the following two properties:

(),^{1}^{1}1 denotes the function while keeping fixed. and

, which is called the reproducing property.
Examples falling under (1) include e.g., kernel ridge regression with the squared loss or softclassification with the hinge loss:
(1) is an optimization problem over a function class () which could generally be intractable. Thanks to the specific structure of RKHS, however, the representer theorem enables one to parameterize the optimal solution of (1) by finitely many coefficients:
(2) 
As a result, (1) becomes a finitedimensional optimization problem determined by the pairwise similarities of the samples []:
(3) 
where the second term follows from the reproducing property of kernels.
However, in many learning problems such as nonlinear variable selection [20, 21], (multitask) gradient learning [38], semisupervised or Hermite learning with gradient information [41, 26], or density estimation with infinitedimensional exponential families [28], apparently considering the derivative information (, ) other than just the function values () turns out to be beneficial. In these tasks containing derivatives, (1) is generalized to the form
(4) 
The solution of this minimization task —similar to (1)—enjoys a finitedimensional parameterization [41]:
where . Hence, the optimization in (4) can be reduced to
(5) 
where , denotes the cardinality of , and we used the derivativereproducing property of kernels
Compared to (3) where the kernel values determine the objective, (5) is determined by the kernel derivatives .
While kernel techniques are extremely powerful due to their modelling capabilities, this flexibility comes with a price, often they are computationally expensive. In order to mitigate this computational bottleneck, several approaches have been proposed in the literature such as the Nyström and subsampling methods [36, 9, 22], sketching [1, 37], or random Fourier features (RFF) [18, 19] and their approximate memoryreduced variants and structured extensions [12, 7, 4].
The focus of the current submission is probably the conceptually simplest and most influential approximation scheme among these approaches, RFF.^{2}^{2}2As a recognition of its influence, the work [18] won the 10year testoftime award at NIPS2017. The RFF technique implements a rather elementary yet powerful approach: it constructs a random, lowdimensional, explicit Fourier feature map () for a continuous, bounded, shiftinvariant kernel relying on the Bochner’s theorem:
The advantage of such a feature map becomes apparent after applying the parametrization:
(6) 
This parameterization can be considered as an approximate version of the reproducing property
is changed to and to . (6) allows one to leverage fast solvers for kernel machines in the primal [(1) or (4)]. This idea has been applied in a wide range of areas such as causal discovery [15], fast functiontofunction regression [16], independence testing [40], convolution neural networks [6], prediction and filtering in dynamical systems [8], or bandit optimization [13].
Despite the tremendous practical success of RFFs, its theoretical understanding is quite limited, with only a few optimal guarantees [29, 23, 27, 14, 32].

Concerning the approximation quality of kernel values, the uniform finitesample bounds of [18, 31] show that
where is a compact set, is its diameter, is the number of RFFs, means convergence in probability. [29] recently proved an exponentially tighter finitesample bound in terms of implying
(7) where denotes almost sure convergence. This bound is optimal w.r.t. and , as it is known from the characteristic function literature [5].

In terms of generalization, [19] showed that generalization error can be attained using RFFs, where denotes the number of training samples. This bound is somewhat pessimistic, leaving the usefulness of RFFs open. Recently [23] proved that generalization performance is attainable in the context of kernel ridge regression, with RFFs. This result settles RFFs in the simplest leastsquares setting with Tikhonov regularization. Recently, the result has been sharpened [14] to with no loss in excess risk, where the effective degrees of freedom can often be significantly smaller than the number of samples.

[27] has investigated the computationalstatistical tradeoffs of RFFs in kernel principal component analysis (KPCA). Their result show that depending on the eigenvalue decay behavior of the covariance operator associated to the kernel, (polynomial decay) or (exponential decay) RFFs are sufficient to match the statistical performance of KPCA, where again denotes the number of samples. [32] proved a similar result showing that number of RFFs is sufficient for optimal statistical performance provided that the spectrum of the covariance operator follows an exponential decay, and presented a streaming algorithm for KPCA relying on the classical Oja’s updates, achieving the same statistical performance.
In contrast to the previous results, the focus of our paper is the investigation of problems involving kernel derivatives [see (4) and (5)]. The idea applied in practice is to formally differentiate (6) giving
(8) 
which is then used in the primal [(4)], and optimized for . From the dual point of view [(5)], this means that implicitly the kernel derivatives are approximated via RFFs. The problem we raise in this paper is how accurate these kernel derivative approximations are.
Our contribution is to show that the same dependency in terms of and can be attained for kernel derivatives as for kernel values depicted in (7). To the best of our knowledge, the tightest available guarantee on kernel derivatives [29] is
In this paper, we prove finite sample bounds on the approximation quality of kernel derivatives, which specifically imply that
(9) 
The possibility of such an exponentially improved dependence in terms of is rather surprising, as in case of kernel derivatives the underlying function classes are no longer uniformly bounded. We circumvent this challenge by applying recent tools from unbounded empirical process theory.
2 Problem Formulation
In this section we formulate our problem after introducing a few notations.
Notations: , and denotes the set of natural numbers, positive integers and real numbers respectively. For , denotes its factorial. is the Gamma function (); (). Let denote the double factorial of , that is, the product of all numbers from to that have the same parity as ; specifically . If is a positive odd integer, then . For , is the derivative of the function. For multiindices , , and we use , to denote partial derivatives. is the inner product between and . is the transpose of , is its Euclidean norm, is the concatenation of the vectors. Let be a Borel set. is the set of Borel probability measures on . is the fold product measure where . is the Banach space of realvalued, power Lebesgue integrable functions on (). , where and ; specifically for the empirical measure, , where and is the Dirac measure supported on . . For positive sequences , , (resp. ) means that is bounded (resp. ). Positive sequences , are said to be asymptotically equivalent, shortly , if . (resp. ) denotes that is bounded in probability (resp. almost surely). The diameter of a compact set is defined as . The natural logarithm is denoted by .
We continue with the formulation of our task. Let be a continuous, bounded, shiftinvariant kernel. By the Bochner theorem [24], it is the Fourier transform of a finite, nonnegative Borel measure :
(10)  
(11) 
where
(a) follows from the realvalued property of , and (b) is a consequence of the trigonometric identity . Without loss of generality, it can be assumed that since and the normalization yields
Let . By differentiating^{3}^{3}3By the dominated convergence theorem, the differentiation is valid if . (11) one gets
(12) 
The resulting expectation can be approximated by the MonteCarlo technique using as
(13) 
where ,
(14) 
Specifically, if then (13) boils down to the celebrated RFF technique [18]:
Our goal is to prove that similarly to [(7)], fast approximation of kernel derivatives [(9)] is attainable. Alternatively, we establish that the derivative (see and (13)(14)) of the RFF feature map () is as efficient for kernel derivative approximation as for kernel value approximation.
3 Main Result
In this section we present our main result on the uniform approximation quality of kernel derivatives using RFFs. Its proof is available in Section 4.
Theorem (Uniform guarantee on kernel derivative approximation).
Suppose that is a continuous, bounded and shiftinvariant kernel. For , assume and for some constant , the following Bernstein condition holds:
(15) 
where . Let , , and . Then for any and compact set
with probablity at most .
Remarks.

Growth of : The theorem proves the same dependence [(9)] on and as is known for kernel values (). The result implies that
if .

based guarantee: From the theorem above one can also get (see Section 4) the following guarantee, where .
Under the same conditions and notations as in the theorem, for any
This shows that
Consequently, if as then is a consistent estimator of in norm provided that .

Bernstein condition with : Next we illustrate how the Bernstein condition [(15)] translates to the efficient estimation of ‘not too large’order kernel derivatives in case of the Gaussian kernel. For simplicity let us consider the Gaussian kernel in one dimension (); in this case is a normal distribution with mean zero and variance . Let and denote the l.h.s. of (15) as
By the analytical formula for the absolute moments of normal random variables
(16) Since does not depend on , one can assume that and . Exploiting the analytical expression obtained for one can show (Section 4) that for

Difficulty: The fundamental difficulty one has to tackle to arrive at the stated theorem is as follows.
By differentiating (10) one gets
By defining
(17) the error we would like to control can be rewritten as the supremum of the empirical process
where . For (i.e., the classical RFFbased kernel approximation)
which is a uniformly bounded family of functions:
This uniform boundedness is the classical assumption of empirical process theory, and allowed one [29] to get the optimal rates. For , however, the functions are unbounded and so is no longer uniformly bounded in . Therefore, one has to control unbounded empirical processes for which only few tools are available.
The key idea of our paper is to apply a recent technique which bounds the supremum as a weighted sum of bracketing entropies of at multiple scales. By estimating these bracketing entropies and optimizing the scale the result will follow. This is what we detail in the next section.
4 Proofs
We provide the proofs of the results (main theorem and its consequence, remark on the Bernstein condition) presented in Section 3. We start by introducing a few additional notations specific to this section.
Notations: The volume of is defined as . is the incomplete Gamma function (, ) that satisfies and , where is the error function (). Let be a metric space. The covering number of is defined as the size of the smallest net, i.e., , where is the closed ball with center and radius . For a set of realvalued functions and , the cardinality of the minimal bracketing of is defined as such that and .
The proof of the main theorem is structured as follows.

First, we rescale and reformulate the approximation error as the suprema of unbounded empirical processes, for which bounds in terms of bracketing entropies at multiple scales can be obtained.

Then, we bound the bracketing entropies via Lipschitz continuity.

Finally, the scale is optimized.
Step 1. It follows from (17) that,
Define so that
(18) 
where . The target quantity can be rewritten in supremum of empirical process form as
By the Bernstein condition [(15)] the following uniform bound holds:
(19) 
The uniform boundedness of [(18)] with its Bernstein property [(19)] imply by [34, Theorem 8] that for all and for all scale
(20) 
where
and is the cardinality of the minimal generalized bracketing set of . Formally, , and for such that .
Step 2. We continue the proof by bounding the entropies and in (20). Using (15) for the envelope function , we get
Hence also satisfies the weaker Bernstein condition. Consequently, one can choose [34, remark after Definition 8], and .
Next we bound (). The function class is Lipschitz continuous in the parameters ():