Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features

# Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features

## Abstract

Despite their success, kernel methods suffer from a massive computational cost in practice. In this paper, in lieu of commonly used kernel expansion with respect to inputs, we develop a novel optimal design maximizing the entropy among kernel features. This procedure results in a kernel expansion with respect to entropic optimal features (EOF), improving the data representation dramatically due to features dissimilarity. Under mild technical assumptions, our generalization bound shows that with only features (disregarding logarithmic factors), we can achieve the optimal statistical accuracy (i.e., ). The salient feature of our design is its sparsity that significantly reduces the time and space cost. Our numerical experiments on benchmark datasets verify the superiority of EOF over the state-of-the-art in kernel approximation.

## 1 Introduction

Kernel methods are powerful tools in describing nonlinear data models. However, despite their success in various machine learning tasks, kernel methods always suffer from scalability issues, especially when the learning task involves matrix inversion (e.g., kernel ridge regression). This is simply due to the fact that for a dataset of size , the inversion step requires time cost. To tackle this problem, a great deal of research has been dedicated to the approximation of kernels using low-rank surrogates [1, 2, 3]. By approximating the kernel, these methods deal with a linear problem, potentially solvable in a linear time with respect to (see e.g. [4] for linear Support Vector Machines (SVM)).

In the approximation of kernel with a finite number of features, one fundamental question is how to select the features. As an example, in supervised learning, we are interested to identify features that lead to low out-of-sample error. This question has been studied in the context of random features, which is an elegant method for kernel approximation [3]. Most of the works in this area improve the out-of-sample performance by modifying the stochastic oracle from which random features are sampled [5, 6, 7]. Nevertheless, these methods deal with dense feature matrices (due to randomness) and still require a large number of features to learn the data subspace. Decreasing the number of features directly affects the time and space costs, and to achieve that we must choose features that are as distinct as possible (to better span the space). Focusing on explicit features, we aim to achieve this goal in the current work.

### 1.1 Our Contributions

In this paper, we study low-rank kernel approximation by finding a set of mutually orthogonal features with nested and compact supports. We first theoretically characterize a condition (based on the Sturm-Liouville problem), which allows us to obtain such features. Then, we propose a novel optimal design method that maximizes the metric entropy among those features. The problem is formulated as a combinatorial optimization with a constraint on the number of features used for approximation. The optimization is generally NP-hard but yields closed-form solutions for specific numbers of features. The algorithm, dubbed entropic optimal features (EOF), can use these features for supervised learning. The construction properties of features (orthogonality, compact support, and nested support) result in a sparse approximation saving dramatically on time and space costs. We establish a generalization bound for EOF that shows with only features (disregarding logarithmic factors), we can achieve the optimal statistical accuracy (i.e., ). Our numerical experiments on benchmark datasets verify the superiority of EOF over the state-of-the-art in kernel approximation. While we postpone the exhaustive literature review to Section 6, none of the previous works has approached the problem from the entropy maximization perspective, which is the unique distinction of the current work.

## 2 Preliminaries on Kernel Methods

Kernel methods map finite-dimensional data to a potentially infinite dimensional feature space. Any element in the reproducing kernel Hilbert space (RKHS) of , denoted by , has the following representation:

 f=∞∑i=1⟨f,gi⟩kgi, (1)

where RKHS inner product induced by and is any feature set (i.e., orthonormal basis) that spans the space . In general, the kernel trick relies on the observation that the inner product with (reproducing property), so is cheap to compute without the need to calculate the inner product. In this case, the feature set selected in equation (7) is and the target function can be written as .

Under mild conditions, by the Representer Theorem, it is guaranteed that any solution of the risk minimization problem assumes the form , where is the number of training data points. However, this representation introduces a massive time cost of and a memory cost of in the training. Further, the feature space may not cover from an optimal sense. To be more specific, there might be another set of features with such that where is the input data.

To address the aforementioned problem, [3] propose a random approximation of

 k(x,x′)≈zT(x)z(x′) (2)

where is a random vector. This decomposes the feature into a linear combination of random low-rank features to approximate the original target function by . This idea resolves the computational issue of the algorithm, but due to random selection of the features, the method does not offer the best candidate features for reconstructing the target function.

Furthermore, in supervised learning the goal is to find a mapping from inputs to outputs, and an optimal kernel approximation does not necessarily result in an optimal target function. The reason is simply that we require the features that best represent the underlying data model (or target function) rather than the kernel function.

## 3 Kernel Feature Selection

In this paper, we propose an algorithm that uses a sparse representation to attain a high prediction accuracy with a low computational cost. The key ingredient is to find an expansion:

 f=∞∑i=1⟨f,gi⟩kgi (3)

such that features satisfy the following properties:

1. Compact support: supt is compact.

2. Nested support: supt for some finite set I.

3. Orthogonality: where denotes the Kronecker delta.

Properties 1-2 ensure low time cost for the algorithm by promoting sparsity. To be more specific, given any finite set and any data point , for a large number of . Property 3 provides a better expansion of .

In general, this problem may be intractable; however, we will prove later in Theorem 2 that when satisfies the following condition, then a feature set that satisfies properties 1-3 does exist:

###### Condition 1.

Let kernel be of the following product form:

 k(x,x′)=D∏d=1p(min{xd,x′d})q(max{xd,x′d})

where and are the independent solutions of the Sturm-Liouville problem on the interval for any :

 ddxα(x)dydx+β(x)y=0,

and they satisfy the following boundary conditions:

 c11p′(a)+c12p(a)=0 c21q′(b)+c22q(b)=0

with for and the operator is an elliptic operator that satisfies Lax-Milgram Theorem (see section 6 of [8]).

We provide two commonly used kernels that satisfy condition 1:

 k(x,x′)=D∏d=1[ωmin{xd,x′d}+1].

The first one is the Laplace kernel and the second one is the kernel associated to weighted Sobolev space [9]. Let for any . Then, when the dimension , features associated to Laplace kernel satisfying properties 1-3 are as follows:

 ϕl,i(x)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩sinhω|x−zl,i+1|sinhω2−lif x∈(zl,i,zl,i+1]sinhω|x−zl,i−1|sinhω2−lif x∈[zl,i−1,zl,i]0  otherwise (4)

and features associated to the weighted Sobolev space kernel are as follows:

 ϕl,i(x)=max{0,1−|x−zl,i|2−l}

where is the index of features. We now start from 1-D kernel to construct a feature space that satisfies properties 1-3:

###### Theorem 1.

Suppose is a kernel that satisfies Condition 1. Let and let . We then define the following function on the interval :

 ϕl,i(x)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩q(x)pl,i+1−p(x)ql,i+1ql,ipl,i+1−pl,iql,i+1ifx∈(zl,i, zl,i+1]p(x)ql,i−1−q(x)pl,i−1pl,iql,i−1−ql,ipl,i−1ifx∈[zl,i−1, zl,i]0 otherwise. (5)

where and . Then the following feature set is an orthogonal basis of the RKHS of , , that satisfies property 1-3 on the unit interval :

 {ϕl,i:l∈N,i∈Bl}.

The theorem above characterizes the set of features that satisfy Condition 1 when the input is scalar. To extend the idea to -dimensional space, we only need to take the tensor product form of the 1-dimensional kernel, as described by the consequent theorem:

###### Theorem 2.

Suppose is a kernel that satisfies Condition 1. For any , we define the Cartesian product of sets as follows:

 Zl=×Dd=1Zld={zl,i=(zl1,i1,⋯,zlD,iD):zld,id∈Zld} Bl=×Dd=1={i∈ND:id∈Bld}.

We then define the following function on the hypercube :

 ϕl,i(x)=D∏d=1ϕld,id(xd) (6)

where the function is defined in Theorem 1. Then the following feature set is an orthogonal basis of the RKHS of , , that satisfies property 1-3 on the unit cube :

 {ϕl,i:l∈ND,i∈Bl}.

The proof of Theorem 1 is given in the supplementary material. Theorem 2 can be derived from Theorem 1, because the kernel is simply the tensor product of 1-dimensional kernel in Theorem 1.

###### Corollary 3.

For any kernel satisfies condition 1 and let be the function defined in Theorem 2. Then we have the following expansion for :

 k(x,x′)=∑l∈ND∑i∈Blϕl,i(x)ϕl,i(x′)⟨ϕl,i,ϕl,i⟩k (7)

where is the inner product induced by .

###### Proof.

We only need to substitute in equation (3) by , then according to the reproducing property of we can have the result. ∎

Corollary 3 is the direct result of Theorem 2. So we can have the following sparse approximation for the value :

 k(x,x′)≈zT(x)z(x′)

where

 z(x)=[ϕl,i(x)||ϕl,i||k](l,i)∈S

for some set . We will show in section 4.1 that most entries on are zero. Form this perspective, the expansion (7) is analogous to the random feature (2) except that the above is nonrandom.

We now use the RKHS of the following kernel on as an example:

 k(x,x′)=min{x,x′}[1−max{x,x′}].

The RKHS associated to is the first order Sobolev space with zero boundary conditions:

 Hk={f:∫10[f′(s)]2ds<∞,f(0)=f(1)=0}.

In this example, the feature functions given by Theorem 1 coincide with a wavelet basis in . Consider the mother wavelet given by the triangular function:

 ϕ(d)=max{0,1−|d|}.

Then for any , , direct calculations show that

 ϕl,i(x)=ϕ(x−i2−l2−l). (8)

Now it is easy to verify that the features satisfy the desired properties 1-3:

1. supt.

2. supt

Figure 1 illustrates the compact and nested supports of these wavelet features.

The compact support properties can lead to a significant improvement in time cost. Consider the evaluation of . The compact support property implies that for most ’s, so that the computational cost of evaluating can be much lower than the total number of features. In Section 4.1, we will leverage this property of the basis functions to propose an efficient algorithm for learning. This goal cannot be achieve when the basis functions are not compactly supported, such as the random features.

Figure 2 shows the example of the tensor product of the wavelet feature defined in (8). It is a 2-dimensional extension of the wavelet feature and according to Theorem 2, the features satisfy properties 1-3 in the RKHS induced by the following kernel:

 k(x,x′)=D∏d=1min{xd,x′d}[1−max{xd,x′d}],

which is the mixed Sobolev space of first order with zero boundary condition on .We refer the reader to [10] for more details on mixed order Sobolev space.

In view of Theorem 2, we can now lift a data point from to a finite dimensional space spanned by features with compact and nested supports. As a result, the evaluation of on a large number of features is zero, yielding a sparse and efficient representation.

## 4 Entropic Optimal Design

In the previous section, we provide conditions under which we can find features with compact and nested supports. We now present an optimization criterion to select the best finite set of features with the maximum metric entropy. The intuition behind this choice is that we favor a set of features that are different from each other as much as possible, so that we can reconstruct the underlying model by a moderate amount of features.

To formulate the optimization problem, we need to introduce some notation. First we introduce the covering number of an operator between two Banach spaces. Let and be Banach spaces with unit balls and , respectively. The covering number of an operator is defined as

 N(T,ε)=infn∈N{n:∃{bi∈B}ni=1 s.t. T(BA)⊆n⋃i=1(bi+εBB)}.

The metric entropy of is then defined as . Now, let be the RKHS associated to kernel with the inner product , and let be the projection operator from to the following finite dimensional subspace

 FS={ϕl,i:(l,i)∈S},

where is defined in Theorem 2 and . Our goal is to find the optimal set (with cardinality at most ), whose corresponding feature set maximizes the entropy. This is equivalent to solving the following optimization problem:

 supSEnt[PS,ε] s.t.|S|≤M. (9)

Following the lines in the proof of Theorem 2 (in the supplementary), we can show that the features in are mutually orthogonal with Hilbert norm:

 ||ϕl,i||2Hk=:C−1l,i, (10)

where as (see lemma 1 in Supplementary Material). We first multiply by to normalize the feature. For any function , we then have

 PSf=∑(l,i)∈SCl,i⟨f,ϕl,i⟩kϕl,i.

As a result, the entropic optimization problem (9) is equivalent to searching an -dimensional Euclidean space with the largest unit ball, which can be characterized as follows

 maxS∑(l,i)∈SCl,i s.t.|S|≤M.

This optimization problem is called the Knapsack problem and, in general, is NP-hard [11]. However, for some specific values of , closed form solutions exist. Consider the Laplace kernel here as an example. For Laplace kernel , from direct calculation, the constant is:

 Cl,i=D∏d=1sinh(ω2−ld).

In this case, is independent of and for any , the value . Therefore, we can derive that when for some , the optimal set is

 S∗n={(l,i):|l|≤n,i∈Bl} (11)

because for any and any , . It turns out the set is equivalent to the Sparse Grid design [10].

### 4.1 Algorithm: Entropic Optimal Features

With the aforementioned theorems, we can now describe the algorithm to compute the regression function associated to a kernel that satisfies Condition 1. Suppose the set given by equation (11) is the index set associated to the feature functions that maximizes the entropy optimization problem (9). So given a specific input , we aim to compute the vector

 z(x)=[Cl,iϕl,i(x)](l,i)∈S∗n=:[zl,i(x)](l,i)∈S∗n

where is the coeffecient defined in (10), is the approximation that satisfies

 k(x,x′)≈z(x)Tz(x′)

in Corollary 3 with the feature function defined in equation (6). We call the entropic optimal feature (EOF).

According to properties 1-3, the supports of are either disjoint or nested. Therefore, only a small amount of entries on are non-zero. To be more specific, given any and input , the supports of are disjoint so we can immediately compute the unique non-zero entry . Algorithm 1 shows how to explicitly compute the EOF at a data point . Note that , denote the ceiling and floor operations, respectively.

The dimension of the vector given levels is [10]. The number of non-zero elements for after running Algorithm 1 is:

 ∑|l|≤n+D−11 =n+D−1∑i=D∑|l|=i1 =n+D−1∑i=D(i−1D−1) =(n+D−1D)=O(nD),

which means fraction of non-zeros to the whole vector in grows with as a function of level .

Time Complexity of EOF in Regression: Based on above, if we fix as the size of , the number of non-zero entries on is . Since we evaluate for each training data, the feature matrix has non-zero elements, resulting in a training cost of , which is smaller than of random features [3], especially when is moderate.

## 5 Generalization Bound

In this section, we present the generalization bound for EOF when it is used in supervised learning. Let us define the approximated target function as

 ^f:=argminf∈FM1NN∑j=1L(yi,f(xi))+λ∥f∥2k,

given independent and identically distributed samples , where denotes the space spanned by the first EOFs; is a loss function; and is a tuning parameter that may depend on . We denote by the true risk. The goal is to bound the generalization error .

We use the following assumptions to establish the bound:

###### Assumption 1.

There exists so that .

###### Assumption 2.

The function is twice differentiable for all . Furthermore, is strongly convex.

###### Assumption 3.

The density function of input is uniformly bounded away from infinity. The outputs are uniformly bounded.

Assumption 1 allows infimum to be achieved in the RKHS. This is not ensured automatically since we deal with a potentially infinite-dimensional RKHS , that is possibly universal (see Remark 2 of [12]). Assumption 2 is true for common loss functions including least squares for regression () and logistic regression for classification (). The bounded output constraint of Assumption 3 is also common in supervised learning.

The generalization bound is given by the following theorem.

###### Theorem 4.

Suppose Assumptions 1-3 are fulfilled. If the tuning parameter is choosing to have , then

 R(^f)−inffR(f)≤Op(N−1/2)+CM−2log4D−4M,

for some . The constants may depend on .

The theorem above shows that with EOFs, the optimal statistical accuracy is achieved up to logarithmic factors. Compared to random features for kernel approximation, this result improves the generalization bound. For random features, the number of required features to achieve the optimal rate is in the case of ridge regression [12].

## 6 Related Literature

We provide related works for kernel approximation from different perspectives:

Random Features (Randomized Kernel Approximation): Randomized features was introduced as an elegant approach for Monte Carlo approximation of shift-invariant kernels [3], and it was later extended for Quasi Monte Carlo approximation [13]. Several methods consider improving the time cost of random features, decreasing it by a linear factor of the input dimension (see e.g., Fast-food [14, 15]). Quadrature-based random features are also shown to boost kernel approximation [16]. The generalization properties of random features have been studied for -regularized risk minimization [17] and ridge regression [12], improving the initial generalization bound of [18]. [19] develop orthogonal random features (ORF) to boost the variance of kernel approximation. ORF is shown to provide optimal kernel estimator in terms of mean-squared error [20]. A number of recent works have considered data-dependent sampling of random features to improve kernel approximation. Examples consist of [21] on compact nonlinear feature maps, [15, 22] on approximation of shift-invariant/translation-invariant kernels, and [23] on data-dependent approximation using greedy approaches (e.g., Frank-Wolfe). Furthermore, data-dependent sampling has been used to improve generalization in supervised learning [5, 7] through target kernel alignment.

Deterministic Kernel Approximation: The studies on finding low-rank surrogates for kernels date back two decades [1, 2]. As an example, the celebrated Nyström method [24, 25] samples a subset of training data for approximating a low-rank kernel matrix. The Nyström method has been further improved in [26] and more recently used for approximation of indefinite kernels [27]. Explicit feature maps have also proved to provide efficient kernel approximation. The works of [28, 29, 30] have proposed low-dimensional Taylor expansions of Gaussian kernel for improving the time cost of learning. [31] further study explicit feature maps for additive homogeneous kernels.

Sparse Approximation Using Greedy Methods: Sparse approximation literature has mostly focused on greedy methods. [32] have developed a matching pursuit algorithm where kernels are the dictionary elements. The work of [33] focuses on sparse regression and classification models using Mercer kernels, and [34] considers sparse regression with multiple kernels. Classical matching pursuit was developed for regression, but further extensions to logistic regression [35] and smooth loss functions [36] have also been studied. [37] propose a greedy reconstruction technique for regression by empirically fitting squared error residuals. [38] also use greedy methods for sparse approximation using multiple kernels.

Our approach is radically different from the prior work in the sense that we characterize a set of features that maximize the entropy. Our feature construction and entropy optimization techniques are novel and have not been explored in the kernel approximation literature.

## 7 Numerical Experiments

Benchmark Algorithm: We now compare EOF with the following random-feature benchmark algorithms on several datasets from the UCI Machine Learning Repository:

1) RKS [18] with approximated Laplace kernel feature , where are sampled from a Cauchy distribution multiplied by , and are sampled from the uniform distribution on .
2) ORF [19] with approximated Gaussian kernel feature , with where is a diagonal matrix, with diagonal entries sampled i.i.d. from the -distribution with degrees and is the orthogonal matrix obtained from the QR decomposition of a matrix with normally distributed entries. Note that ORF approximates a Gaussian kernel.
3) LKRF [5] with approximated Laplace kernel feature , with first a larger number random features are sampled and then re-weighted by solving a kernel alignment optimization. The top random features would be used in the training.
4) EERF [7], with approximated Laplace kernel feature where first a larger number random features are sampled and then re-weighted according to a score function. The top random features would appear in the training.

Experiment Setup: We also use approximated Laplace kernel feature where with defined as equation (4). To determine the value of used in RKS, EERF, LKRF and ORF we choose the value of for each dataset to be the mean distance of the nearest neighbor [19]. We then calculate the corresponding for EOF associated to . The number of features in EOF is a function of dimension and level , so it is not possible to calculate them for any . To resolve this issue, for any given , we select the set defined in (11) that satisfies

 ∣∣S∗n−1∣∣

and randomly select pairs of to have a random set . We then use the following approximated feature:

 zM(x):=[ϕl,i(x)](l,i)∈SM.

This is equivalent to randomly select rows from the feature .

We let for LKRF and EERF, then for any , we compare the performance of different algorithms.

Datasets: In Table 1, we report the number of training samples and test samples used for each dataset. For the MNIST data set, we map the original dimensional data to a dimensional space using an auto-encoder. If the training and test samples are not provided separately for a dataset, we split it randomly. We standardize the data as follows: we scale each input to the unit interval and the responses in regression to be inside .

Comparison: For a fixed number of features, we perform 50 simulation runs for each algorithm on each data set. We then report the average test error (with standard errors) in Fig. 3 where the plot line is the mean error of an algorithm and the error bar reflects the standard deviation of the error. Throughout our experiments, we can see that EOF consistently improves the test error compared to other randomized-feature algorithms. This is specifically visible when the gap between and becomes very small and, due to the optimality of , EOF outperforms any random feature algorithm.

In Table 2, we also compare the time complexity and space complexity. We define the feature matrix

 F:=[z(xi)]Ni=1,

which is an matrix with the number of features and the number of data. Due to the sparse structure of EOF, we can also see that the number of non-zero entries of the associated to EOF is smaller than other methods. When both the dimension and the size of data are large, the sparsity of EOF becomes more obvious as shown in the case of MNIST. The time cost of running EOF is also quite impressive. It is consistently better than EERF and LKRF and comparable and slightly slower than RKS. In fact, the major time for EOF is spent on feature matrix construction. For random features, due to high efficiency of matrix operations in Matlab, feature construction is fast. However, for EOF the feature construction via matrix operations is not possible in an efficient way. We observed that after the feature matrix construction, EOF is the fastest method in training. For example, if we only count the training time (excluding feature construction) as the time cost, in kernel ridge regression on the dataset Superconductivity, the comparison between RKS and EOF is as follows:

The run time is obtained on a MacPro with a 4-core, 3.3 GHz Intel Core i5 CPU and 8 GB of RAM (2133Mhz).

## 8 Conclusion

We provide a method to construct a set of mutually orthogonal features (with nested and small supports) and select the best of them that maximize the entropy of the associated projector. The nested and compact support of feature functions greatly reduces the time and space cost for feature matrix operations. The orthogonality and entropic optimality reduces dramatically the error of approximation. We have provided generalization error bound which indicates that only features are needed to achieve the optimal accuracy. Future directions include generalizing this method to a broader class of kernels.

## Appendix A Proof of Theorem 1

The kernel function:

 k(x,y)=p(min{x,y})q(max{x,y})

is in fact the Green’s function of the Sturm-Liouville operator [39]

 L:=ddxα(x)ddx+β(x)

So the inner product product induced by is

 ⟨f,g⟩k=∫10fLgdx

For any and , the supports of and are and respectively. This two supports are disjoint because both and are odd so if . For any and any , the supports supt and supt are either disjoint or nested. If they are disjoint, then . If they are nested, , without loss of generality assume and , then because both and satisfy:

 Lp=Lq=0

so

 ⟨ϕl,i,ϕn,j⟩k =∫(i+1)2−l(i−1)2−lϕn,jLϕl,idx =∫(i+1)2−l(i−1)2−lϕn,jLp(x)ql,i−1−q(x)pi,l−1pl,iql,i−1−ql,ipl,i−1dx =0.

As a result, we have

 ⟨ϕl,i,ϕn,j⟩k=λl,iδ(l,i),(n,j)

where is a function of and .

## Appendix B Proof of Theorem 4

We need the following lemmas.

###### Lemma 5.

Denote . Then we have

 R(fM)−R(f0)≤CM−2log4D−4M∥f0∥2k,

for some constant .

###### Proof.

According to Assumption 2, we can see that

 R(fM)−R(f0)=E[m′′y(u∗)(fM(x)−f0(x))2]

In view of Assumption 3, we only need to prove

 ∥fM−f0∥2L2=CM−2log4D−4M∥f0∥2k

for any we then can finish the proof. Let . According to theorem 2, we have the following expansion:

 ∥fM−f0∥L2 =∥∑|l|>n∑i∈Bl⟨f0,ϕl,i∥ϕl,i∥k⟩kϕl,i(⋅)∥ϕl,i∥k∥L2 =∥∑|l|>n∑i∈Bi∫Sl,if0(s)Lϕl,i(s)dsϕl,i(⋅)∥ϕl,i∥2k∥L2.

where is the support of . We let

 v(⋅)l:=∑i∈Bi∫Sl,if0(s)Lϕl,i(s)dsϕl,i(⋅)∥ϕl,i∥2k.

Our first goal is to estimate . From theorem 2 of [40] or direct calculation based on the property of Green’s function, we can see that for any :

 ∫Sl,if(s)Lϕl,i(s)ds=[D⨂d=1Δld,id]f

where

 Δld,idf:=αld,idf∣∣xd=zld,id −βld,id−1f∣∣xd=zld,id−1−βld,id+1f∣∣xd=zld,id