Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features
Abstract
Despite their success, kernel methods suffer from a massive computational cost in practice. In this paper, in lieu of commonly used kernel expansion with respect to inputs, we develop a novel optimal design maximizing the entropy among kernel features. This procedure results in a kernel expansion with respect to entropic optimal features (EOF), improving the data representation dramatically due to features dissimilarity. Under mild technical assumptions, our generalization bound shows that with only features (disregarding logarithmic factors), we can achieve the optimal statistical accuracy (i.e., ). The salient feature of our design is its sparsity that significantly reduces the time and space cost. Our numerical experiments on benchmark datasets verify the superiority of EOF over the stateoftheart in kernel approximation.
1 Introduction
Kernel methods are powerful tools in describing nonlinear data models. However, despite their success in various machine learning tasks, kernel methods always suffer from scalability issues, especially when the learning task involves matrix inversion (e.g., kernel ridge regression). This is simply due to the fact that for a dataset of size , the inversion step requires time cost. To tackle this problem, a great deal of research has been dedicated to the approximation of kernels using lowrank surrogates [1, 2, 3]. By approximating the kernel, these methods deal with a linear problem, potentially solvable in a linear time with respect to (see e.g. [4] for linear Support Vector Machines (SVM)).
In the approximation of kernel with a finite number of features, one fundamental question is how to select the features. As an example, in supervised learning, we are interested to identify features that lead to low outofsample error. This question has been studied in the context of random features, which is an elegant method for kernel approximation [3]. Most of the works in this area improve the outofsample performance by modifying the stochastic oracle from which random features are sampled [5, 6, 7]. Nevertheless, these methods deal with dense feature matrices (due to randomness) and still require a large number of features to learn the data subspace. Decreasing the number of features directly affects the time and space costs, and to achieve that we must choose features that are as distinct as possible (to better span the space). Focusing on explicit features, we aim to achieve this goal in the current work.
1.1 Our Contributions
In this paper, we study lowrank kernel approximation by finding a set of mutually orthogonal features with nested and compact supports. We first theoretically characterize a condition (based on the SturmLiouville problem), which allows us to obtain such features. Then, we propose a novel optimal design method that maximizes the metric entropy among those features. The problem is formulated as a combinatorial optimization with a constraint on the number of features used for approximation. The optimization is generally NPhard but yields closedform solutions for specific numbers of features. The algorithm, dubbed entropic optimal features (EOF), can use these features for supervised learning. The construction properties of features (orthogonality, compact support, and nested support) result in a sparse approximation saving dramatically on time and space costs. We establish a generalization bound for EOF that shows with only features (disregarding logarithmic factors), we can achieve the optimal statistical accuracy (i.e., ). Our numerical experiments on benchmark datasets verify the superiority of EOF over the stateoftheart in kernel approximation. While we postpone the exhaustive literature review to Section 6, none of the previous works has approached the problem from the entropy maximization perspective, which is the unique distinction of the current work.
2 Preliminaries on Kernel Methods
Kernel methods map finitedimensional data to a potentially infinite dimensional feature space. Any element in the reproducing kernel Hilbert space (RKHS) of , denoted by , has the following representation:
(1) 
where RKHS inner product induced by and is any feature set (i.e., orthonormal basis) that spans the space . In general, the kernel trick relies on the observation that the inner product with (reproducing property), so is cheap to compute without the need to calculate the inner product. In this case, the feature set selected in equation (7) is and the target function can be written as .
Under mild conditions, by the Representer Theorem, it is guaranteed that any solution of the risk minimization problem assumes the form , where is the number of training data points. However, this representation introduces a massive time cost of and a memory cost of in the training. Further, the feature space may not cover from an optimal sense. To be more specific, there might be another set of features with such that where is the input data.
To address the aforementioned problem, [3] propose a random approximation of
(2) 
where is a random vector. This decomposes the feature into a linear combination of random lowrank features to approximate the original target function by . This idea resolves the computational issue of the algorithm, but due to random selection of the features, the method does not offer the best candidate features for reconstructing the target function.
Furthermore, in supervised learning the goal is to find a mapping from inputs to outputs, and an optimal kernel approximation does not necessarily result in an optimal target function. The reason is simply that we require the features that best represent the underlying data model (or target function) rather than the kernel function.
3 Kernel Feature Selection
In this paper, we propose an algorithm that uses a sparse representation to attain a high prediction accuracy with a low computational cost. The key ingredient is to find an expansion:
(3) 
such that features satisfy the following properties:

Compact support: supt is compact.

Nested support: supt for some finite set I.

Orthogonality: where denotes the Kronecker delta.
Properties 12 ensure low time cost for the algorithm by promoting sparsity. To be more specific, given any finite set and any data point , for a large number of . Property 3 provides a better expansion of .
In general, this problem may be intractable; however, we will prove later in Theorem 2 that when satisfies the following condition, then a feature set that satisfies properties 13 does exist:
Condition 1.
Let kernel be of the following product form:
where and are the independent solutions of the SturmLiouville problem on the interval for any :
and they satisfy the following boundary conditions:
with for and the operator is an elliptic operator that satisfies LaxMilgram Theorem (see section 6 of [8]).
We provide two commonly used kernels that satisfy condition 1:
The first one is the Laplace kernel and the second one is the kernel associated to weighted Sobolev space [9]. Let for any . Then, when the dimension , features associated to Laplace kernel satisfying properties 13 are as follows:
(4) 
and features associated to the weighted Sobolev space kernel are as follows:
where is the index of features. We now start from 1D kernel to construct a feature space that satisfies properties 13:
Theorem 1.
Suppose is a kernel that satisfies Condition 1. Let and let . We then define the following function on the interval :
(5) 
where and . Then the following feature set is an orthogonal basis of the RKHS of , , that satisfies property 13 on the unit interval :
The theorem above characterizes the set of features that satisfy Condition 1 when the input is scalar. To extend the idea to dimensional space, we only need to take the tensor product form of the 1dimensional kernel, as described by the consequent theorem:
Theorem 2.
Suppose is a kernel that satisfies Condition 1. For any , we define the Cartesian product of sets as follows:
We then define the following function on the hypercube :
(6) 
where the function is defined in Theorem 1. Then the following feature set is an orthogonal basis of the RKHS of , , that satisfies property 13 on the unit cube :
The proof of Theorem 1 is given in the supplementary material. Theorem 2 can be derived from Theorem 1, because the kernel is simply the tensor product of 1dimensional kernel in Theorem 1.
Corollary 3.
Proof.
We only need to substitute in equation (3) by , then according to the reproducing property of we can have the result. ∎
Corollary 3 is the direct result of Theorem 2. So we can have the following sparse approximation for the value :
where
for some set . We will show in section 4.1 that most entries on are zero. Form this perspective, the expansion (7) is analogous to the random feature (2) except that the above is nonrandom.
We now use the RKHS of the following kernel on as an example:
The RKHS associated to is the first order Sobolev space with zero boundary conditions:
In this example, the feature functions given by Theorem 1 coincide with a wavelet basis in . Consider the mother wavelet given by the triangular function:
Then for any , , direct calculations show that
(8) 
Now it is easy to verify that the features satisfy the desired properties 13:

supt.

supt

Figure 1 illustrates the compact and nested supports of these wavelet features.
The compact support properties can lead to a significant improvement in time cost. Consider the evaluation of . The compact support property implies that for most ’s, so that the computational cost of evaluating can be much lower than the total number of features. In Section 4.1, we will leverage this property of the basis functions to propose an efficient algorithm for learning. This goal cannot be achieve when the basis functions are not compactly supported, such as the random features.
Figure 2 shows the example of the tensor product of the wavelet feature defined in (8). It is a 2dimensional extension of the wavelet feature and according to Theorem 2, the features satisfy properties 13 in the RKHS induced by the following kernel:
which is the mixed Sobolev space of first order with zero boundary condition on .We refer the reader to [10] for more details on mixed order Sobolev space.
In view of Theorem 2, we can now lift a data point from to a finite dimensional space spanned by features with compact and nested supports. As a result, the evaluation of on a large number of features is zero, yielding a sparse and efficient representation.
4 Entropic Optimal Design
In the previous section, we provide conditions under which we can find features with compact and nested supports. We now present an optimization criterion to select the best finite set of features with the maximum metric entropy. The intuition behind this choice is that we favor a set of features that are different from each other as much as possible, so that we can reconstruct the underlying model by a moderate amount of features.
To formulate the optimization problem, we need to introduce some notation. First we introduce the covering number of an operator between two Banach spaces. Let and be Banach spaces with unit balls and , respectively. The covering number of an operator is defined as
The metric entropy of is then defined as . Now, let be the RKHS associated to kernel with the inner product , and let be the projection operator from to the following finite dimensional subspace
where is defined in Theorem 2 and . Our goal is to find the optimal set (with cardinality at most ), whose corresponding feature set maximizes the entropy. This is equivalent to solving the following optimization problem:
(9) 
Following the lines in the proof of Theorem 2 (in the supplementary), we can show that the features in are mutually orthogonal with Hilbert norm:
(10) 
where as (see lemma 1 in Supplementary Material). We first multiply by to normalize the feature. For any function , we then have
As a result, the entropic optimization problem (9) is equivalent to searching an dimensional Euclidean space with the largest unit ball, which can be characterized as follows
This optimization problem is called the Knapsack problem and, in general, is NPhard [11]. However, for some specific values of , closed form solutions exist. Consider the Laplace kernel here as an example. For Laplace kernel , from direct calculation, the constant is:
In this case, is independent of and for any , the value . Therefore, we can derive that when for some , the optimal set is
(11) 
because for any and any , . It turns out the set is equivalent to the Sparse Grid design [10].
4.1 Algorithm: Entropic Optimal Features
With the aforementioned theorems, we can now describe the algorithm to compute the regression function associated to a kernel that satisfies Condition 1. Suppose the set given by equation (11) is the index set associated to the feature functions that maximizes the entropy optimization problem (9). So given a specific input , we aim to compute the vector
where is the coeffecient defined in (10), is the approximation that satisfies
in Corollary 3 with the feature function defined in equation (6). We call the entropic optimal feature (EOF).
According to properties 13, the supports of are either disjoint or nested. Therefore, only a small amount of entries on are nonzero. To be more specific, given any and input , the supports of are disjoint so we can immediately compute the unique nonzero entry . Algorithm 1 shows how to explicitly compute the EOF at a data point . Note that , denote the ceiling and floor operations, respectively.
The dimension of the vector given levels is [10]. The number of nonzero elements for after running Algorithm 1 is:
which means fraction of nonzeros to the whole vector in grows with as a function of level .
Time Complexity of EOF in Regression: Based on above, if we fix as the size of , the number of nonzero entries on is . Since we evaluate for each training data, the feature matrix has nonzero elements, resulting in a training cost of , which is smaller than of random features [3], especially when is moderate.
5 Generalization Bound
In this section, we present the generalization bound for EOF when it is used in supervised learning. Let us define the approximated target function as
given independent and identically distributed samples , where denotes the space spanned by the first EOFs; is a loss function; and is a tuning parameter that may depend on . We denote by the true risk. The goal is to bound the generalization error .
We use the following assumptions to establish the bound:
Assumption 1.
There exists so that .
Assumption 2.
The function is twice differentiable for all . Furthermore, is strongly convex.
Assumption 3.
The density function of input is uniformly bounded away from infinity. The outputs are uniformly bounded.
Assumption 1 allows infimum to be achieved in the RKHS. This is not ensured automatically since we deal with a potentially infinitedimensional RKHS , that is possibly universal (see Remark 2 of [12]). Assumption 2 is true for common loss functions including least squares for regression () and logistic regression for classification (). The bounded output constraint of Assumption 3 is also common in supervised learning.
The generalization bound is given by the following theorem.
Theorem 4.
The theorem above shows that with EOFs, the optimal statistical accuracy is achieved up to logarithmic factors. Compared to random features for kernel approximation, this result improves the generalization bound. For random features, the number of required features to achieve the optimal rate is in the case of ridge regression [12].
6 Related Literature
We provide related works for kernel approximation from different perspectives:
Random Features (Randomized Kernel Approximation): Randomized features was introduced as an elegant approach for Monte Carlo approximation of shiftinvariant kernels [3], and it was later extended for Quasi Monte Carlo approximation [13]. Several methods consider improving the time cost of random features, decreasing it by a linear factor of the input dimension (see e.g., Fastfood [14, 15]). Quadraturebased random features are also shown to boost kernel approximation [16]. The generalization properties of random features have been studied for regularized risk minimization [17] and ridge regression [12], improving the initial generalization bound of [18]. [19] develop orthogonal random features (ORF) to boost the variance of kernel approximation. ORF is shown to provide optimal kernel estimator in terms of meansquared error [20]. A number of recent works have considered datadependent sampling of random features to improve kernel approximation. Examples consist of [21] on compact nonlinear feature maps, [15, 22] on approximation of shiftinvariant/translationinvariant kernels, and [23] on datadependent approximation using greedy approaches (e.g., FrankWolfe). Furthermore, datadependent sampling has been used to improve generalization in supervised learning [5, 7] through target kernel alignment.
Deterministic Kernel Approximation: The studies on finding lowrank surrogates for kernels date back two decades [1, 2]. As an example, the celebrated Nyström method [24, 25] samples a subset of training data for approximating a lowrank kernel matrix. The Nyström method has been further improved in [26] and more recently used for approximation of indefinite kernels [27]. Explicit feature maps have also proved to provide efficient kernel approximation. The works of [28, 29, 30] have proposed lowdimensional Taylor expansions of Gaussian kernel for improving the time cost of learning. [31] further study explicit feature maps for additive homogeneous kernels.
Sparse Approximation Using Greedy Methods: Sparse approximation literature has mostly focused on greedy methods. [32] have developed a matching pursuit algorithm where kernels are the dictionary elements. The work of [33] focuses on sparse regression and classification models using Mercer kernels, and [34] considers sparse regression with multiple kernels. Classical matching pursuit was developed for regression, but further extensions to logistic regression [35] and smooth loss functions [36] have also been studied. [37] propose a greedy reconstruction technique for regression by empirically fitting squared error residuals. [38] also use greedy methods for sparse approximation using multiple kernels.
Our approach is radically different from the prior work in the sense that we characterize a set of features that maximize the entropy. Our feature construction and entropy optimization techniques are novel and have not been explored in the kernel approximation literature.
7 Numerical Experiments
Benchmark Algorithm: We now compare EOF with the following randomfeature benchmark algorithms on several datasets from the UCI Machine Learning Repository:
1) RKS [18] with approximated Laplace kernel feature , where are sampled from a Cauchy distribution multiplied by , and are sampled from the uniform distribution on .
2) ORF [19] with approximated Gaussian kernel feature , with where is a diagonal matrix, with diagonal entries sampled i.i.d. from the distribution with degrees and is the orthogonal matrix obtained from the QR decomposition of a matrix with normally distributed entries. Note that ORF approximates a Gaussian kernel.
3) LKRF [5] with approximated Laplace kernel feature , with first
a larger number random features are sampled and then reweighted by solving a kernel alignment
optimization. The top random features would be used in the training.
4) EERF [7], with approximated Laplace kernel feature where first
a larger number random features are sampled and then reweighted according to a score function. The top random features would appear in the training.
Experiment Setup: We also use approximated Laplace kernel feature where with defined as equation (4). To determine the value of used in RKS, EERF, LKRF and ORF we choose the value of for each dataset to be the mean distance of the nearest neighbor [19]. We then calculate the corresponding for EOF associated to . The number of features in EOF is a function of dimension and level , so it is not possible to calculate them for any . To resolve this issue, for any given , we select the set defined in (11) that satisfies
and randomly select pairs of to have a random set . We then use the following approximated feature:
This is equivalent to randomly select rows from the feature .
We let for LKRF and EERF, then for any , we compare the performance of different algorithms.
Datasets: In Table 1, we report the number of training samples and test samples used for each dataset. For the MNIST data set, we map the original dimensional data to a dimensional space using an autoencoder. If the training and test samples are not provided separately for a dataset, we split it randomly. We standardize the data as follows: we scale each input to the unit interval and the responses in regression to be inside .
Data set  Task  D  

MNIST  Classification  32  20000  10000 
Electrical Grids Stability  Classification  13  7000  3000 
Superconductivity  Regression  81  15000  6263 
Energy Efficiency  Regression  8  512  256 
Comparison: For a fixed number of features, we perform 50 simulation runs for each algorithm on each data set. We then report the average test error (with standard errors) in Fig. 3 where the plot line is the mean error of an algorithm and the error bar reflects the standard deviation of the error. Throughout our experiments, we can see that EOF consistently improves the test error compared to other randomizedfeature algorithms. This is specifically visible when the gap between and becomes very small and, due to the optimality of , EOF outperforms any random feature algorithm.
MNIST  

Method  nnz()  
RKS  80  1.64  
EERF  80  800  4.43  
LKRF  80  800  3.07  
ORF  80  1.21  
EOF  80  2048  2.45  
Superconductivity  
Method  nnz()  
RKS  160  0.10  
EERF  160  1600  0.45  
LKRF  160  1600  0.37  
ORF  160  0.13  
EOF  160  161  0.14 
Electrical Grids Stability  

Method  nnz()  
RKS  60  0.04  
EERF  60  600  0.14  
LKRF  60  600  0.13  
ORF  60  0.06  
EOF  60  338  0.08  
Energy Efficiency  
Method  nnz()  
RKS  60  0.01  
EERF  60  600  0.05  
LKRF  60  600  0.06  
ORF  60  0.02  
EOF  60  128  0.03 
In Table 2, we also compare the time complexity and space complexity. We define the feature matrix
which is an matrix with the number of features and the number of data. Due to the sparse structure of EOF, we can also see that the number of nonzero entries of the associated to EOF is smaller than other methods. When both the dimension and the size of data are large, the sparsity of EOF becomes more obvious as shown in the case of MNIST. The time cost of running EOF is also quite impressive. It is consistently better than EERF and LKRF and comparable and slightly slower than RKS. In fact, the major time for EOF is spent on feature matrix construction. For random features, due to high efficiency of matrix operations in Matlab, feature construction is fast. However, for EOF the feature construction via matrix operations is not possible in an efficient way. We observed that after the feature matrix construction, EOF is the fastest method in training. For example, if we only count the training time (excluding feature construction) as the time cost, in kernel ridge regression on the dataset Superconductivity, the comparison between RKS and EOF is as follows:
RKS  

EOF 
The run time is obtained on a MacPro with a 4core, 3.3 GHz Intel Core i5 CPU and 8 GB of RAM (2133Mhz).
8 Conclusion
We provide a method to construct a set of mutually orthogonal features (with nested and small supports) and select the best of them that maximize the entropy of the associated projector. The nested and compact support of feature functions greatly reduces the time and space cost for feature matrix operations. The orthogonality and entropic optimality reduces dramatically the error of approximation. We have provided generalization error bound which indicates that only features are needed to achieve the optimal accuracy. Future directions include generalizing this method to a broader class of kernels.
Supplementary Material
Appendix A Proof of Theorem 1
The kernel function:
is in fact the Green’s function of the SturmLiouville operator [39]
So the inner product product induced by is
For any and , the supports of and are and respectively. This two supports are disjoint because both and are odd so if . For any and any , the supports supt and supt are either disjoint or nested. If they are disjoint, then . If they are nested, , without loss of generality assume and , then because both and satisfy:
so
As a result, we have
where is a function of and .
Appendix B Proof of Theorem 4
We need the following lemmas.
Lemma 5.
Denote . Then we have
for some constant .
Proof.
According to Assumption 2, we can see that
In view of Assumption 3, we only need to prove
for any we then can finish the proof. Let . According to theorem 2, we have the following expansion:
where is the support of . We let
Our first goal is to estimate . From theorem 2 of [40] or direct calculation based on the property of Green’s function, we can see that for any :
where