On Kernel Derivative Approximation with Random Fourier Features

On Kernel Derivative Approximation with Random Fourier Features

Zoltán Szabó CMAP, École Polytechnique
Route de Saclay, 91128 Palaiseau, France
zoltan.szabo@polytechnique.edu
Bharath K. Sriperumbudur Department of Statistics
Pennsylvania State University
314 Thomas Building
University Park, PA 16802
bks18@psu.edu
Abstract

Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. To the best of our knowledge, the only existing areas where precise statistical-computational trade-offs have been established are approximation of kernel values, kernel ridge regression, and kernel principal component analysis. Our goal is to spark the investigation of optimality of RFF-based approximations in tasks involving not only function values but derivatives, which naturally lead to optimization problems with kernel derivatives. Particularly, in this paper, we focus on the approximation quality of RFFs for kernel derivatives and prove that the existing finite-sample guarantees can be improved exponentially in terms of the domain where they hold, using recent tools from unbounded empirical process theory. Our result implies that the same approximation guarantee is achievable for kernel derivatives using RFF as for kernel values.

1 Introduction

Kernel techniques [3, 30, 17] are among the most influential and widely-applied tools, with significant impact on virtually all areas of machine learning and statistics. Their versatility stems from the function class associated to a kernel called reproducing kernel Hilbert space (RKHS) [2] which shows tremendous success in modelling complex relations.

The key property that makes kernel methods computationally feasible and the optimization over RKHS tractable is the representer theorem [11, 25, 39]. Particularly, given samples , consider the regularized empirical risk minimization problem specified by a kernel , the associated RKHS , a loss function , and a penalty parameter :

 minf∈HkJ0(f):=1nn∑i=1V(yi,f(xi))+λ∥f∥2Hk, (1)

where is the Hilbert space defined by the following two properties:

1. (),111 denotes the function while keeping fixed. and

2. , which is called the reproducing property.

Examples falling under (1) include e.g., kernel ridge regression with the squared loss or soft-classification with the hinge loss:

 V(f(xi),yi) =(f(xi)−yi)2, V(f(xi),yi) =max(1−yif(xi),0).

(1) is an optimization problem over a function class () which could generally be intractable. Thanks to the specific structure of RKHS, however, the representer theorem enables one to parameterize the optimal solution of (1) by finitely many coefficients:

 (2)

As a result, (1) becomes a finite-dimensional optimization problem determined by the pairwise similarities of the samples []:

 minc∈Rn~J0(c) :=1nn∑i=1V(yi,n∑j=1cjk(xi,xj))+λn∑i=1n∑j=1cicjk(xi,xj), (3)

where the second term follows from the reproducing property of kernels.

However, in many learning problems such as nonlinear variable selection [20, 21], (multi-task) gradient learning [38], semi-supervised or Hermite learning with gradient information [41, 26], or density estimation with infinite-dimensional exponential families [28], apparently considering the derivative information (, ) other than just the function values () turns out to be beneficial. In these tasks containing derivatives, (1) is generalized to the form

 minf∈HkJ(f):=1nn∑i=1V(yi,{∂pf(xi)}p∈Ii)+λ∥f∥2Hk. (4)

The solution of this minimization task —similar to (1)—enjoys a finite-dimensional parameterization [41]:

 f(⋅) =n∑j=1∑p∈Ijcj,p∂p,0k(⋅,xj),(cj,p∈R),

where . Hence, the optimization in (4) can be reduced to

 minc~J(c) =1nn∑i=1V⎛⎝yi,{n∑j=1∑p∈Ijcj,p∂p,0k(xi,xj)}p∈Ii⎞⎠+λn∑i=1∑p∈Iin∑j=1∑q∈Ijci,pcj,p∂p,qk(xi,xj), (5)

where , denotes the cardinality of , and we used the derivative-reproducing property of kernels

 ∂pf(x)=⟨f,∂p,0k(⋅,x)⟩Hk.

Compared to (3) where the kernel values determine the objective, (5) is determined by the kernel derivatives .

While kernel techniques are extremely powerful due to their modelling capabilities, this flexibility comes with a price, often they are computationally expensive. In order to mitigate this computational bottleneck, several approaches have been proposed in the literature such as the Nyström and sub-sampling methods [36, 9, 22], sketching [1, 37], or random Fourier features (RFF) [18, 19] and their approximate memory-reduced variants and structured extensions [12, 7, 4].

The focus of the current submission is probably the conceptually simplest and most influential approximation scheme among these approaches, RFF.222As a recognition of its influence, the work [18] won the 10-year test-of-time award at NIPS-2017. The RFF technique implements a rather elementary yet powerful approach: it constructs a random, low-dimensional, explicit Fourier feature map () for a continuous, bounded, shift-invariant kernel relying on the Bochner’s theorem:

 ^k(x,y) =⟨φ(x),φ(y)⟩, φ :Rd→Rm.

The advantage of such a feature map becomes apparent after applying the parametrization:

 ^f(x)=⟨w,φ(x)⟩,w∈Rm. (6)

This parameterization can be considered as an approximate version of the reproducing property

 f(x)=⟨f,k(⋅,x)⟩Hk,

is changed to and to . (6) allows one to leverage fast solvers for kernel machines in the primal [(1) or (4)]. This idea has been applied in a wide range of areas such as causal discovery [15], fast function-to-function regression [16], independence testing [40], convolution neural networks [6], prediction and filtering in dynamical systems [8], or bandit optimization [13].

Despite the tremendous practical success of RFF-s, its theoretical understanding is quite limited, with only a few optimal guarantees [29, 23, 27, 14, 32].

• Concerning the approximation quality of kernel values, the uniform finite-sample bounds of [18, 31] show that

 ∥∥k−ˆk∥∥L∞(S×S) :=supx,y∈S∣∣k(x,y)−ˆk(x,y)∣∣=Op(|S|√m−1logm),

where is a compact set, is its diameter, is the number of RFFs, means convergence in probability. [29] recently proved an exponentially tighter finite-sample bound in terms of implying

 ∥∥k−ˆk∥∥L∞(S×S)=Oa.s.(√log|S|/√m), (7)

where denotes almost sure convergence. This bound is optimal w.r.t.  and , as it is known from the characteristic function literature [5].

• In terms of generalization, [19] showed that generalization error can be attained using RFFs, where denotes the number of training samples. This bound is somewhat pessimistic, leaving the usefulness of RFFs open. Recently [23] proved that generalization performance is attainable in the context of kernel ridge regression, with RFFs. This result settles RFFs in the simplest least-squares setting with Tikhonov regularization. Recently, the result has been sharpened [14] to with no loss in excess risk, where the effective degrees of freedom can often be significantly smaller than the number of samples.

• [27] has investigated the computational-statistical trade-offs of RFFs in kernel principal component analysis (KPCA). Their result show that depending on the eigenvalue decay behavior of the covariance operator associated to the kernel, (polynomial decay) or (exponential decay) RFFs are sufficient to match the statistical performance of KPCA, where again denotes the number of samples. [32] proved a similar result showing that number of RFFs is sufficient for optimal statistical performance provided that the spectrum of the covariance operator follows an exponential decay, and presented a streaming algorithm for KPCA relying on the classical Oja’s updates, achieving the same statistical performance.

In contrast to the previous results, the focus of our paper is the investigation of problems involving kernel derivatives [see (4) and (5)]. The idea applied in practice is to formally differentiate (6) giving

 ˆ∂pf(x):=∂p^f(x)=⟨w,∂pφ(x)⟩, (8)

which is then used in the primal [(4)], and optimized for . From the dual point of view [(5)], this means that implicitly the kernel derivatives are approximated via RFFs. The problem we raise in this paper is how accurate these kernel derivative approximations are.

Our contribution is to show that the same dependency in terms of and can be attained for kernel derivatives as for kernel values depicted in (7). To the best of our knowledge, the tightest available guarantee on kernel derivatives [29] is

 ∥∥∂p,qk−ˆ∂p,qk∥∥L∞(S×S)=Oa.s.(|S|√m−1logm).

In this paper, we prove finite sample bounds on the approximation quality of kernel derivatives, which specifically imply that

 ∥∥∂p,qk−ˆ∂p,qk∥∥L∞(S×S) =Oa.s.(√log|S|/√m). (9)

The possibility of such an exponentially improved dependence in terms of is rather surprising, as in case of kernel derivatives the underlying function classes are no longer uniformly bounded. We circumvent this challenge by applying recent tools from unbounded empirical process theory.

Our paper is structured as follows. We formulate our problem in Section 2. The main result on the approximation quality of kernel derivatives is presented in Section 3. Proofs are provided in Section 4.

2 Problem Formulation

In this section we formulate our problem after introducing a few notations.

Notations: , and denotes the set of natural numbers, positive integers and real numbers respectively. For , denotes its factorial. is the Gamma function (); (). Let denote the double factorial of , that is, the product of all numbers from to that have the same parity as ; specifically . If is a positive odd integer, then . For , is the derivative of the function. For multi-indices , , and we use , to denote partial derivatives. is the inner product between and . is the transpose of , is its Euclidean norm, is the concatenation of the vectors. Let be a Borel set. is the set of Borel probability measures on . is the -fold product measure where . is the Banach space of real-valued, -power Lebesgue integrable functions on (). , where and ; specifically for the empirical measure, , where and is the Dirac measure supported on . . For positive sequences , , (resp. ) means that is bounded (resp. ). Positive sequences , are said to be asymptotically equivalent, shortly , if . (resp. ) denotes that is bounded in probability (resp. almost surely). The diameter of a compact set is defined as . The natural logarithm is denoted by .

We continue with the formulation of our task. Let be a continuous, bounded, shift-invariant kernel. By the Bochner theorem [24], it is the Fourier transform of a finite, non-negative Borel measure :

 k(x,y) =~k(x−y)=∫Rde√−1ωT(x−y)dΛ(ω)\lx@stackrel(a)=∫Rdcos(ωT(x−y))dΛ(ω) (10) \lx@stackrel(b)=∫Rd[cos(ωTx)cos(ωTy)+sin(ωTx)sin(ωTy)]dΛ(ω)=∫Rd⟨ϕω(x),ϕω(y)⟩R2dΛ(ω), (11)

where

 ϕω(x) =[cos(ωTx);sin(ωTx)].

(a) follows from the real-valued property of , and (b) is a consequence of the trigonometric identity . Without loss of generality, it can be assumed that since and the normalization yields

 k(x,y)~k(0)=∫Rdcos(ωT(x−y))dΛ(ω)Λ(Rd)=:P(ω)∈M1+(Rd).

Let . By differentiating333By the dominated convergence theorem, the differentiation is valid if . (11) one gets

 ∂p,qk(x,y) =∫Rd⟨∂pϕω(x),∂qϕω(y)⟩R2dΛ(ω). (12)

The resulting expectation can be approximated by the Monte-Carlo technique using as

 ˆ∂p,qk(x,y) =∫Rd⟨∂pϕω(x),∂qϕω(y)⟩R2dΛm(ω)=1mm∑j=1⟨∂pϕωj(x),∂qϕωj(y)⟩R2 =⟨φp(x),φq(y)⟩R2m, (13)

where ,

 φp(x) (14)

Specifically, if then (13) boils down to the celebrated RFF technique [18]:

 ˆk(x,y) =⟨φ0(x),φ0(y)⟩R2m, φ0(x) =1√m(cos(ωTjx);sin(ωTjx))mj=1.

Our goal is to prove that similarly to [(7)], fast approximation of kernel derivatives [(9)] is attainable. Alternatively, we establish that the derivative (see and (13)-(14)) of the RFF feature map () is as efficient for kernel derivative approximation as for kernel value approximation.

3 Main Result

In this section we present our main result on the uniform approximation quality of kernel derivatives using RFFs. Its proof is available in Section 4.

Theorem (Uniform guarantee on kernel derivative approximation).

Suppose that is a continuous, bounded and shift-invariant kernel. For , assume and for some constant , the following Bernstein condition holds:

 ∫Rd|ωp+q|n(σp,q)ndΛ(ω)≤n!2Kn−2,n=2,3,…, (15)

where . Let , , and . Then for any and compact set

 Λm({ω1:m:∥∥∂p,qk−ˆ∂p,qk∥∥L∞(S×S)≥σp,q(C3√dln(16|S|Cp,q+4)√m+C1√m+C2m++24√6√m[√t+Lmt2])})

with probablity at most .

Remarks.

• Growth of : The theorem proves the same dependence [(9)] on and as is known for kernel values (). The result implies that

 ∥∥∂p,qk−ˆ∂p,qk∥∥L∞(Sm×Sm)a.s.−−→0

if .

• Requirements for : In this case ,

• , thus (15) holds ().

• The only requirement is the finiteness of , which is identical to that imposed in [29, Theorem 1] for kernel values.

• -based guarantee: From the theorem above one can also get (see Section 4) the following guarantee, where .

Under the same conditions and notations as in the theorem, for any

 aaa+C1√m+C2m+24√6√m[√t+Lmt2])})≤2e−t.

This shows that

 ∥∥∂p,qk−ˆ∂p,qk∥∥Lr(S×S)=Oa.s.(m−12|S|2dr√log|S|).

Consequently, if as then is a consistent estimator of in -norm provided that .

• Bernstein condition with : Next we illustrate how the Bernstein condition [(15)] translates to the efficient estimation of ‘not too large’-order kernel derivatives in case of the Gaussian kernel. For simplicity let us consider the Gaussian kernel in one dimension (); in this case is a normal distribution with mean zero and variance . Let and denote the l.h.s. of (15) as

 Ar,n(Λ) =∫R|ω|rndΛ(ω)[√∫R|ω|2rdΛ(ω)]n.

By the analytical formula for the absolute moments of normal random variables

 Ar,n(Λ) =σnr(nr−1)!!{1if nr is % even√2πif nr is odd[σ2r(2r−1)!!]n2=(nr−1)!!{1if nr % is even√2πif nr is odd[(2r−1)!!]n2. (16)

Since does not depend on , one can assume that and . Exploiting the analytical expression obtained for one can show (Section 4) that for

• : Since , (15) holds with .

• : is a suitable choice in (15).

• and : Asymptotic argument shows that (15) can not hold.

It is an interesting open question whether one can relax (15) while maintaining similar rates, and what are the trade-offs.

• Difficulty: The fundamental difficulty one has to tackle to arrive at the stated theorem is as follows.

By differentiating (10) one gets

 ∂p,qk(x,y) =∫Rdωp(−ω)qc|p+q|(ωT(x−y))dΛ(ω).

By defining

 gz(ω) =ωp(−ω)qc|p+q|(ωTz), (17)

the error we would like to control can be rewritten as the supremum of the empirical process

where . For (i.e., the classical RFF-based kernel approximation)

 gz(ω) =cos(ωTz)(z∈SΔ)

which is a uniformly bounded family of functions:

 supz∈SΔ∥gz∥L∞(Rd)≤1.

This uniform boundedness is the classical assumption of empirical process theory, and allowed one [29] to get the optimal rates. For , however, the functions are unbounded and so is no longer uniformly bounded in . Therefore, one has to control unbounded empirical processes for which only few tools are available.

The key idea of our paper is to apply a recent technique which bounds the supremum as a weighted sum of bracketing entropies of at multiple scales. By estimating these bracketing entropies and optimizing the scale the result will follow. This is what we detail in the next section.

4 Proofs

We provide the proofs of the results (main theorem and its consequence, remark on the Bernstein condition) presented in Section 3. We start by introducing a few additional notations specific to this section.

Notations: The volume of is defined as . is the incomplete Gamma function (, ) that satisfies and , where is the error function (). Let be a metric space. The -covering number of is defined as the size of the smallest -net, i.e., , where is the closed ball with center and radius . For a set of real-valued functions and , the cardinality of the minimal -bracketing of is defined as such that and .

The proof of the main theorem is structured as follows.

1. First, we rescale and reformulate the approximation error as the suprema of unbounded empirical processes, for which bounds in terms of bracketing entropies at multiple scales can be obtained.

2. Then, we bound the bracketing entropies via Lipschitz continuity.

3. Finally, the scale is optimized.

Step 1. It follows from (17) that,

 ∥gz∥ :=∥gz∥L2(Rd,Λ)=√Λg2z≤√∫Rd∣∣ωp+q∣∣2dΛ(ω)=:σp,q.

Define so that

 ∥fz∥≤1∀z∈SΔ⇒supf∈F∥f∥≤1, (18)

where . The target quantity can be rewritten in supremum of empirical process form as

 supx,y∈S∣∣∂p,qk(x,y)−ˆ∂p,qk(x,y)∣∣=supz∈SΔ|Λgz−Λmgz| =σp,qsupf∈F|(Λ−Λm)f|=:σp,q∥Λ−Λm∥F.

By the Bernstein condition [(15)] the following uniform bound holds:

 supfz:z∈SΔΛ|fz|n ≤∫Rd|ωp+q|n(σp,q)ndΛ(ω)≤n!2Kn−2(n=2,3,…). (19)

The uniform boundedness of [(18)] with its Bernstein property [(19)] imply by [34, Theorem 8] that for all and for all scale

 Λm({ω1:m:supf∈F∣∣√m(Λ−Λm)f∣∣≥minSES+36K√m+24√6[√t+Lmt2]})≤2e−t, (20)

where

 ES :=2−S√m+14S∑s=02−s√6Hs+36KH0√m, Lm :=√6K2√m,Hs:=ln(Ns+1), Ns :=N[](2−s,F,∥⋅∥), H0 =ln(N0+1),

and is the cardinality of the minimal generalized bracketing set of . Formally, , and for such that .

Step 2. We continue the proof by bounding the entropies and in (20). Using (15) for the envelope function , we get

 Λ(Fn) =Λ([supf∈F|f|]n)=Λ(supf∈F|f|n)≤∫Rd|ωp+q|n(σp,q)ndΛ(ω)≤n!2Kn−2,n=2,3,…

Hence also satisfies the weaker Bernstein condition. Consequently, one can choose [34, remark after Definition 8], and .

Next we bound (). The function class is Lipschitz continuous in the parameters ():

 |fz1(ω)−fz2(ω)| =∣∣ωp(−ω)qc|p+q|(ωTz1)−ωp(−ω)qc|p+q|(ωTz2)∣∣σp,q =|ωp+q|∣∣c|p+q|(ωTz1)−c|p+q|(ωTz2)∣∣σp,q\lx@stackrel(a)≤|ωp+q|σp,q∣∣