Nyström Subspace Learning for Large-scale SVMs

# Nyström Subspace Learning for Large-scale SVMs

## Abstract

As an implementation of the Nyström method, Nyström computational regularization (NCR) imposed on kernel classification and kernel ridge regression has proven capable of achieving optimal bounds in the large-scale statistical learning setting, while enjoying much better time complexity. In this study, we propose a Nyström subspace learning (NSL) framework to reveal that all you need for employing the Nyström method, including NCR, upon any kernel SVM is to use the efficient off-the-shelf linear SVM solvers as a black box. Based on our analysis, the bounds developed for the Nyström method are linked to NSL, and the analytical difference between two distinct implementations of the Nyström method is clearly presented. Besides, NSL also leads to sharper theoretical results for the clustered Nyström method. Finally, both regression and classification tasks are performed to compare two implementations of the Nyström method.

## 1 Introduction

As well theoretically developed statistical approaches to machine learning, kernel support vector machines (SVMs) have achieved success in a broad range of fields. However, the limitation pops up when dealing with large-scale data, as the time complexity for obtaining an optimal solution is generally , where refers to the number of training samples. To address this issue, much effort has been devoted to developing efficient strategies to build up large-scale kernel SVMs.

One popular and efficient way for achieving scalable kernel SVMs is the Nyström method, which was first introduced to the machine learning community by Williams and Seeger (2001). The main idea of the Nyström method is to select a set of () landmark points to provide a low-rank approximation for a full Gram matrix. Later, enormous effort has devoted to the Nyström method, bringing in randomized or deterministic algorithms, which has proven useful in applications where the full Gram matrices are replaced by well-approximated low-rank matrices (Kumar et al., 2012; Sun et al., 2015; Gittens and Mahoney, 2016).

To analyze and compare different strategies for selecting landmark points, there have been three considered measurements: 1) the Gram matrix approximation (Drineas and Mahoney, 2005; Kumar et al., 2012), 2) the solution approximation (Cortes et al., 2010), and 3) the generalization error (Yang et al., 2012; Jin et al., 2013; Rudi et al., 2015). In the machine learning community, the generalization error is of primary interest. Specifically, Jin et al. (2013) and Rudi et al. (2015) proved that Nyström computational regularization (NCR) (an implementation of the Nyström method) imposed on kernel classification and kernel ridge regression (KRR) is able to preserve optimal learning guarantees in the large-scale statistical learning setting. We also note that an equivalent form of NCR has already been studied in a previous study (Yang et al., 2012), though it aims to demonstrate the superiority of the Nyström method over Fourier random features (Rahimi and Recht, 2008). After all, the following two issues are not well addressed: 1) How NCR relates to other counterparts — low-rank linearization approach (LLA) (Lan et al., 2019) and the another implementation of the Nyström method (Williams and Seeger, 2001; Sun et al., 2015), which we call the standard Nyström. 2) How to apply the Nyström method, including NCR and the standard Nyström, upon other forms of kernel SVMs without solving each approximate kernel SVM individually.

Inspired by the Nyström method and other linearization techniques (Rahimi and Recht, 2008; Chang et al., 2010), LLA was proposed to map the data from the endowed reproducing kernel Hilbert space into a Euclidean space with a low dimension , after which fast linear SVM solvers can be utilized. We have noticed that LLA has been successfully adopted by other practitioners (Golts and Elad, 2016). But the lack of a well-theoretically development has somewhat isolated LLA from NCR.

In this study, we start with Nyström subspace learning (NSL) that serves as an anchor to address the aforementioned issues, which is different from previous works that usually rely on matrix analysis or linear operators. The main idea of NSL is that it relates the Nyström method to kernel principal component analysis (KPCA), which is able to unravel the relationships among NCR, the standard Nyström, and LLA, and also provide sharper theoretical results for the clustered Nyström method (Zhang and Kwok, 2010).

The main contribution of this study is three-fold. First, with the aid of NSL, we prove that NCR NLF + linear SVM learning LLA. Notably, the conclusion indicates that even though NCR aims to regularize the training phrase, it implicitly performs NSL over all data as the first step, which is closely related to KPCA. Second, NSL suggests a way to ease the application of the Nyström method, including NCR and the standard Nyström, over any kernel SVM by using off-the-shelf linear SVM solvers directly. This point will be clear when the relationships among NCR, the standard Nyström, and LLA are uncovered. Besides, NSL also provides sharper theoretical results for the clustered Nyström method. Specifically, our analysis serves as a complement to a related study (Oglic and Gärtner, 2017).

In what follows, related work is first introduced to cover the preliminaries. Then, the proposed Nyström subspace learning (NSL) is formulated as a cornerstone for delineating the connections between LLA, NCR and the standard Nyström. Afterwards, generalized theories for the clustered Nyström method is developed. Finally, we design some experiments to compare NCR and the standard Nyström. Some lengthy proofs and additional empirical results are left in the Supplementary File.

## 2 Related Work

#### Notation

To be consistent, the bold letters are used for representing matrices or ordered sets of vectors in a Hilbert space (upper cases), and column vectors or vectors in a Hilbert space (lower cases), while the plain letters denote scalars or functions. Given a matrix , is the -th column of , refers to its pseudo-inverse, and is its -th element. Table 1 lists some further mathematical definitions herein. Note that , so the expression is without ambiguity. Besides, .

### 2.1 The Nyström Method

The Nyström method has become the most popular kernel matrix approximation method in the machine learning community since its introduction (Williams and Seeger, 2001). Suppose represents a dataset where and refer to the number of features and samples, respectively. Then, let be the corresponding Gram matrix with implicit feature map where is the unique real reproducing kernel Hilbert space coupled with an inner product operator and the resulting norm operator . Note that is symmetric positive semi-definite (SPSD). Denote by , and let or represent the set of landmark points. Note that the landmark points can be either selected from or . In the former case, . Denote by , and let be , and be . The Nyström method approximates the optimal rank- () approximation of with respect to unitarily invariant norm, e.g., Frobenius norm, trace norm or spectral norm, as

 K≈˜K=KnmW†(s)Kmn (1)

where is the optimal rank- approximation of with respect to unitarily invariant norm.

In general, most popular randomized Nyström methods can be summarized by a sketching matrix , which implicitly represents the selection of landmark points. From this angle, and . Different Nyström methods employ different sketching matrices. Obviously, the bounds for different measurements depend on how to construct the sketching matrix , and the size that serves to provide better rank- approximation. Regarding the sketching matrix , existing methods fall into two categories: 1) column selection, and 2) random projection. For column selection, is expressed as where is the sampling matrix that if the -th sample of is chosen in the -th independent random trial and otherwise, and serves as a diagonal rescaling matrix. Typical studies concerning column selection include uniform sampling, diagonal sampling (Drineas and Mahoney, 2005), and leverage score sampling (Drineas et al., 2012). For random projection, could be designed to implement Gaussian projection or subsampled randomized Hadamard transform (Gittens and Mahoney, 2016) such that is a random linear combinations of the columns of . By contrast, as a deterministic approach, the clustered Nyström method uses (kernel) k-means clustering centers as landmark points, i.e., or contains clustering centers (Zhang and Kwok, 2010; Oglic and Gärtner, 2017). We also notice that there are other types of variants (Kumar et al., 2009; Wang and Zhang, 2013; Si et al., 2016), but herein we focus on the studies based on constructing landmark points.

As an implementation of the Nyström method, the standard Nyström simply replaces the full Gram matrix with the approximate one without modifying the hypothesis in learning tasks. To efficiently apply the rank- approximation , it is further reformulated as where . Generally, the most computationally expensive steps in many methods such as Gaussian process regression and kernel SVMs are to calculate where , or . The former can be efficiently solved by utilizing the Woodbury formula

 (2)

whereas the latter can be efficiently calculated through

 AAT=RΛ2RT,Y=ATRΛ−1,˜K†=YΛ−2YT. (3)

Here, the first step is a compact singular value decomposition (SVD), i.e., the singular value of is excluded. However, this is not as convenient as the off-the-shelf linear SVM solvers can be used directly as a black box.

For comparing the schemes for generating landmark points, the bounds for Gram matrix approximation has drawn tremendous attention, i.e.,

 ∥∥K−˜K∥∥ξ (4)

where that indicates Frobenius norm, is spectral norm, and refers to trace norm. A comprehensive study regarding the bounds can be found in (Gittens and Mahoney, 2016). However, the bounds for generalization error should be the most sought-after when analyzing learning tasks.

### 2.2 Nyström Computational Regularization

To reach generalization bounds of the Nyström method, Jin et al. (2013) and Rudi et al. (2015) imposed a Nyström computational regularization (NCR) upon KRR and kernel classification, respectively. Here, NCR is different from the standard Nyström. In the statistic learning setting, the former study deduces the corresponding generalization bounds for different column selection schemes, whereas the latter obtains the related generalization bound provided that landmark points are selected properly. In this work, we focus on KRR, while similar results of kernel classification can be obtained by using techniques presented herein.

The optimization problem of KRR can be formulated as

 argminf∈Hn1nn∑i=1(f(xiH)−yi)2+λ∥f∥2H. (5)

Here, , , and .

The idea of NCR is to regularize the hypothesis into a carefully selected subspace . With replaced by in the problem (5), its optimal solution is

 ^f=m∑i=1˜αiciH with ˜α=(KmnKnm+λ0W)†Kmny (6)

where (Rudi et al., 2015). When applying NCR upon kernel classification, Jin et al. (2013) derived a separate analytical form of the optimal solution. Therefore, it will be cumbersome to apply NCR over other kernel SVMs if the corresponding expressions of optimal solutions need re-deducing individually.

### 2.3 Low-Rank Linearization Approach

Motivated by the efficiency of linear SVM solvers, the goal of the linearization approach is to find a map that transforms all data from into a low-dimensional space . Denote the mapped training samples by with . The linearization approach attempts to find a -dimensional feature map such that

 K≈˜K=MTM. (7)

In other words, both the linearization approach and the Nyström method seek to approximate the full Gram matrix well. Therefore, it is natural to integrate the basic concept of the Nyström method into linearization approach, which is the motivation of LLA (Lan et al., 2019). Considering Eq. (1), the matrix can be reformulated as , which is a compact SVD. If in Eq. (7) is set to be , we have , indicating that the map

 (8)

is a sought-after -dimensional map in LLA. At a first glance, it seems that LLA is entirely isolated from NCR. But, it will be uncovered by Nyström subspace learning (NSL) framework that NCR NSL + linear SVM learning LLA.

## 3 Proposed Framework

### 3.1 Reformulation of KPCA

Before getting into the main results, a reformulation of KPCA is presented as follows:

###### Proposition 1.

With a dataset , let be where . Denote the dimension of by . Let be a variable such that . For the optimization problem

 argmintH,BHn∑i=1∥∥BH⟨BH,xiH−tH⟩H−(xiH−tH)∥∥2H subject to ⟨BH,BH⟩H=I, (9)

using the following procedures: 1) by a compact SVD where the diagonal elements of are in descending order; 2) ; and 3) let be the first vectors in , then the solution is optimal for the problem above.

###### Remark 1.

Here, serves as a basis for a hyperplane, whereas is a translation of the hyperplane. Therefore, the problem (9) aims to find a hyperplane with a translation the best fits the given data in the Hilbert space , and its solution is exactly what KPCA looks for. To the best of our knowledge, the existing studies over KPCA cope with distinct optimization problems (Schölkopf et al., 1997; Sterge et al., 2019). Specifically, the centralization step in KPCA, i.e., for all , is generally done empirically. Hence, we present a rigorous proof in detail in the Supplementary File, and herein we will use it to establish our main result.

### 3.2 Nyström Subspace Learning (NSL)

Instead of approximating the optimal rank- representation of the Gram matrix , we pay attention to the optimal -dimensional subspace. In other words, leaving out the translation variable , subspace learning aims to find the optimal solution that constitutes the basis of an optimal hyperplane that best fits the data .

As indicated by Proposition 1, the time complexity of generating an optimal -dimensional subspace is due to a compact SVD of . Therefore, following the Nyström method, we approximate subspace learning by using carefully selected landmark points. Specifically, we assume that where is the sketching matrix. With , the optimal -dimensional subspace generated from is employed as an approximate solution for the one obtained when using . Denote the approximate basis by . With any data , the filtered outcome will be . The whole procedure of NSL is outlined in Algorithm 1. Note that the formulation of Nyström subspace learning is connected to randomized SVD (Halko et al., 2011; Boutsidis and Gittens, 2013), which focuses on matrices instead. Specifically, Lemma 2 and Lemma 3 presented in the Supplementary File offer an analytical tool to employ the bounds developed for randomized SVD upon Nyström subspace learning.

Considering the Gram matrix of the filtered training data , note that , then

 ˜K =⟨˜B∗H⟨˜B∗H,XH⟩H,˜B∗H⟨˜B∗H,XH⟩H⟩H (10) =KnmV(s)Σ−2(s)VT(s)Kmn=KnmW†(s)Kmn.

By comparing Eq. (10) and Eq. (1), one can observe that NSL can be treated as an expansion of the Nyström method. The advantage of this standpoint will be clear in the following. To estimate the approximation, an empirical bound taken into consideration is

 ∥∥XH−˜B∗H⟨˜B∗H,XH⟩H∥∥ξ. (11)

Here, that refers to operator norm. Indeed, the bounds above are connected to the Gram matrix approximation bounds (4), as the following Proposition tells.

###### Proposition 2.
 ∥∥K−˜K∥∥∗=∥∥XH−˜B∗H⟨˜B∗H,XH⟩H∥∥H, (12) ∥∥K−˜K∥∥2=∥∥XH−˜B∗H⟨˜B∗H,XH⟩H∥∥2op. (13)

The proof is presented in the Supplementary File. Proposition 2 shows that NSL and the Nyström method are closely related. Therefore, most theoretical bounds of the Nyström method regarding the Gram matrix approximation can be directly used for NSL, and vice versa.

### 3.3 Lla = Nsl + Linear SVM Learning

The following is the general formulation of kernel SVMs:

 (14)

where is a loss function, and is a non-decreasing regularizing function. The regularization results from the representer theorem. To be self-contained, a related proof is provided in the Supplementary File.

NSL suggests that any data in the Hilbert space , including training samples and any unseen sample , can be filtered as and , respectively. But as will be shown later, the filtering upon is done automatically by the generated optimal solution to the considered problem below. With filtered training samples , the optimization problem (14) becomes

 (15)

which is further equivalent to

 argminw∈Rs1nn∑i=1L(wT˜li,yi)+Ω(∥w∥2F) (16)

where and . Equivalence (a bijection between and ) holds due to . A detailed reasoning is given in the Supplementary File. Denote the optimal solution of the problem (16) by . Then, when applying the optimal solution to the problem (15) over , it is

 ^wT⟨˜B∗H,˜x∗H⟩H =^wT⟨˜B∗H,˜B∗H⟨˜B∗H,x∗H⟩H⟩H (17) =^wT⟨˜B∗H,x∗H⟩H =^wTΣ−1(s)VT(s)⟨CH,x∗H⟩H.
###### Remark 2.

It is worth mentioning that there are two important information carried on Eq. (17). First, the second equality shows that the optimal solution derived from the problem (15) implicitly applies upon unseen samples. In other words, the filtering over unseen data is automatic. The second is that the last equality suggests that optimizing the problem (15) and then applying the obtained optimal solution is equivalent to: 1) mapping all data in into by using , and then 2) performing linear SVM learning and the corresponding application. To sum, the whole procedure related to the learning problem (15) is listed in Algorithm 2.

Considering that in Subsection 2.3 and the procedure in Algorithm 2, note that , it makes sense to expect that and . Comparing Eq. (17), in the problem (16), and the -dimensional map (8) sought by LLA, it is obvious that the procedure in Algorithm 2 is exactly LLA, except that we develop it based on NSL. Note that the development of LLA proposed by Lan et al. (2019) cannot relate itself to Nyström computational regularization, or further reveals its relation with the standard Nyström.

#### Computation complexity

The complexity of the Algorithm 2 is as efficient as the Nyström method. Generally, the most computationally expensive step in the training stage is to generate , which is at most . Note that matrix multiplication could be speeded up by splitting into blocks and then computing in parallel. Specifically, if uniform sampling is employed, the corresponding complexity of LLA becomes .

### 3.4 Comparison between LLA and the standard Nyström

###### Remark 3.

The previous works somewhat mix the standard Nyström and NCR ( LLA), though they are different. For example, in study (Cortes et al., 2010), it is the standard Nyström when analyzing KRR, but becomes an equivalent form of NCR when turning to the kernel SVM with hinge loss. But they do not make the difference clear.

With , the problem (15) can be solved via

 argminα∈Rn1nn∑i=1L(αT˜ki,yi)+Ω(∥∥αT˜Kα∥∥2F), (18)

which is the approximate kernel SVM problem when implementing the standard Nyström. If is an optimal solution to the problem (18), is an approximate optimal solution by employing the standard Nyström. Unlike Equivalence between the problems (15) and (16), the transformation from the hypothesis of the problem (18) into that of the problem (15) is not necessarily a bijection, but at least a surjection.

Notably, the analysis above indicates that the problem (18) is also connected to the problem (16), even though their hypotheses are distinct. In other words, there must be an optimal solution to the problem (18) such that , and vice versa. Since the dimension of the hypothesis in the problem (18) is larger than in the problem (16), it could be that searching an optimal solution over is easier to be tackled, which is a potential edge of solving the problem (16) instead of the other. In fact, the optimal solution generated by the standard Nyström can also be obtained by using , which will be mentioned in Subsection 3.6.

If we focus on solving the problem (18), the error between the solutions obtained by using the standard Nyström and (NCR ) LLA is

 ∥∥XH^α−˜XH^α∥∥H≤∥∥XH−˜XH∥∥op∥^α∥F. (19)

Here, the inequality is immediate according to the definition. Moreover, combining the above inequality with Proposition 2, there is

 (20)

Therefore, if the error of NSL or equivalently Gram matrix approximation is sufficiently small, the solutions between LLA and the standard Nyström are comparable.

### 3.5 Ncr ⊂ Nsl + Linear SVM Learning

To analyze what is going on when imposing NCR, it needs to assume that is the dimension of , which makes NCR a special case. So, and . Since , NCR imposed upon kernel SVM can be expressed as

 (21)

Let be , since

 ⟨˜B∗H,XH⟩H =⟨˜B∗H,˜B∗H⟩H⟨˜B∗H,XH⟩H (22) =⟨˜B∗H,˜B∗H⟨˜B∗H,XH⟩H⟩H =⟨˜B∗H,˜XH⟩H,

the problem (21) can be equivalently transformed into the problem (16), which is further equivalent to the problem (15). In other words, when is assumed to be the dimension of , imposing NCR is exactly LLA. If is relaxed, NCR LLA NSL linear SVM learning.

###### Remark 4.

Equivalence between NCR and LLA unravels that NCR implicitly implements a Nyström subspace learning for all data as the first step, although the regularization is meant to speed up the resolution of kernel SVMs at the training stage.

#### LLA and NCR over KRR are the same

Obviously, KRR defined by the problem (5) with Nyström computational regularization can be solved in forms of (16), (18) and (21), which leads to three different analytical expressions of optimal solutions. Note that equivalences presented previously do not mean three solutions must be equal. At least, if the optimal solution to the problem (16) is unique, they must be the same. Here, we aim to explicitly show that three analytical solutions for KKR are indeed equal.

###### Proposition 3.

Regarding KRR, let be the dimension of , then in form of (18), the solution is

 ˜XH(˜K+λ0I)−1y, (23)

whereas, it becomes

 CHH(HTKmnKnmH+λ0I)−1HTKmny (24)

in form of (16) where , and it is

 CH(KmnKnm+λ0W)†Kmny (25)

in the remaining form (21). In fact, they are all the same.

The proof is provided in the Supplementary File.

Since the generalization bounds for KRR with NCR have been well-studied (Yang et al., 2012; Rudi et al., 2015), Proposition 3 tells that LLA upon KKR could share the same generalization bounds with NCR upon KRR.

### 3.6 A Simpler Way for the Standard Nyström

###### Proposition 4.

Denote an optimal solution to the problem (16) by , then

 P†^w with P=Σ−1(s)VT(s)Kmn (26)

is an optimal solution to the problem (18).

Please refer to the Supplementary File for full deduction. Suppose is generated efficiently, which depends on the chosen selection strategy. The computation complexity for performing is . Therefore, it does not load any burden on the standard Nyström, but makes it simpler to implement.

###### Remark 5.

Traditionally, to apply the standard Nystöm, one needs to plug the Woodbury formula (2) or the procedure (3) into the corresponding optimization procedure. By contrast, Proposition 4 suggests a simpler way to implement the standard Nyström, as one can employ the off-the-shelf linear SVM solvers directly with some additional transformations.

### 3.7 NSL for the Clustered Nyström Method

Zhang and Kwok (2010) observed that if the set of the selected landmark points contains two training samples, say and , probably , then . Motivated by such an observation, the clustered Nyström method was accordingly proposed, and has proved practical in applications. This observation is immediate from the point of NSL, and a general statement is provided as follows:

###### Proposition 5.

For any point , define by , and by . Then, for any point and , the related reconstruction error satisfies that . Therefore, the reconstruction error is if or belongs to .

Proof. Since , it is obvious that . The inequality holds due to Cauchy-Schwarz inequality. Clearly, if and only if .

#### A potential drawback of the clustered Nyström method

As suggested by Proposition 5, if the goal is to minimize the reconstruction error , is expected to capture as more training samples as possible. However, due to the non-linearity of , the clustered Nyström method could produce a solution that involves a trifling part. To be precise, suppose the clusters is expressed by , where is a cluster indicator matrix that tells how to form the clusters from the given samples. There is no guarantee that . So, it could be that such that and (orthogonal complement). In this case,

 ˜K=⟨˜XH,˜XH⟩H=⟨^XH,^XH⟩H. (27)

Here and . So, does not help provide better approximation. It suggests that it would be better to do clustering over directly, which we call the kernel clustered Nyström method. In this case, the sketching matrix becomes the cluster indicator matrix.

#### A sharp bound for the (kernel) clustered Nyström method

Suppose () contains clusters in , let be the corresponding mutually-disjoint clustering partition over such that for each and each , where . Define the related kernel clustering error by

 E(CH)=k∑i=1∑xH∈Si∥∥xH−ciH∥∥2H, (28)

then we have the following sharp bound for the (kernel) clustered Nyström method.

###### Proposition 6.

Let be the dimension of , then

 ∥∥K−˜K∥∥F≤E(CH). (29)

The proof is presented in the Supplementary File.

###### Remark 6.

Compared with the initial work (Zhang and Kwok, 2010), the bound provided by Proposition 6 is sharper. Particularly, unlike the initial work, we do not make any assumption over the selected kernel function. In fact, some similar theoretical results over the kernel clustered Nyström method has already been presented by Oglic and Gärtner (2017). But, our perspective is different from theirs. To be precise, their development leading to the interpretation of is based on the best extrapolation for by using landmark points , which is distinct from ours. Here, by combining Proposition 3 and Theorem 4 presented therein in their work, Proposition 6 herein further leads to the following Corollary, which is similar to their main theorem.

###### Corollary 1.

Suppose is the dimension of . If the clusters are obtained by using kernel -means++ algorithm (Arthur and Vassilvitskii, 2006), then it holds

 E⎡⎣∥∥K−˜K∥∥F∥∥K−K(k)∥∥F⎤⎦≤8(ln(k+1)+2)(√n−k+Θk). (30)

Here, is the best rank- approximation of with respect to unitarily invariant norm, whereas is adopted from their work. Besides, Proposition 6 can also be used to deduce the theoretical trade-offs when using approximate kernel -means algorithm (Wang et al., 2019) by incorporating the theorems developed in their work. But we will not present them herein. In a nutshell, even though kernel -means clustering is computationally expensive for large-scale data, it could be approximately performed in an efficient way while preserving good Gram matrix approximation.

## 4 Experiment

The performances of the NCR pertaining to KRR have been demonstrated by (Jin et al., 2013; Rudi et al., 2015). But the empirical study over the comparison between NCR and the standard Nyström is somewhat insufficient, which might result from that it is inconvenient to apply the standard Nyström over a variety of kernel SVMs without Proposition 4 proposed herein. Therefore, to provide complementary study, we compare the standard Nyström and NCR in both regression and classification tasks. Unlike previous studies, Proposition 4 allows us to implement the standard Nyström upon any kernel SVM by employing the efficient off-the-shelf linear SVM solvers as a black box.

### 4.1 Experiment Setup

Experiments herein are performed on a computer with 8 2.40 GHz Intel(R) Core(TM) i7-4700HQ CPU with 16 GB of RAM. For selection strategies, the clustered Nyström method (denoted by CN) and its kernel version (denoted by KCN) are taken into consideration. To perform the kernel clustered Nyström method without any approximation, we focus on datasets with less than samples. Four datasets1 are employed: a) abalone (; ) and b) space_ga (; ) for regression tasks, while c) satimage (; ; #class ) and d) dna (; ; #class ) for classification tasks. The latter two have already been divided into a training set, a validation set and a testing set. For the former two datasets, we randomly split the whole dataset into a training part (), a validation part () and a testing part (). KRR is considered for regression tasks, whereas -SVM is utilized for classification tasks. Specifically, KMeans and NuSVC from sklearn are employed for (kernel) -means clustering and -SVM, respectively. The maximum iterations for KMeans and NuSVC are fixed as and , respectively. Following previous studies, Gaussian kernel is selected. Since the considered selection strategies involve randomness, the averaged result with its standard deviation over the first random seeds is reported. The subset size , i.e., clusters, is gradually increased from to of the size of the training data for each dataset. Here, is set to be the dimension of , or equivalently .

We tune the hyperparameters based on the training and validation sets. The considered ranges for and are and , respectively. The ones selected are: a) abalone with and , b) space_ga with and , c) satimage with and , and d) dna with and .

### 4.2 Results

The corresponding results are shown in Figure 1. An interesting conclusion is that it is hard to tell which implementations, the standard Nyström or NCR, is better. Over the regression task space_ga, NCR significantly outperforms the standard Nyström. But for the classification task dna, NCR becomes inferior to the standard Nyström. Moreover, the empirical results attest to our conclusion over Eq. (20) that the difference between the standard Nyström and NCR tends to be smaller when the corresponding Gram matrix approximation goes better. More experimental results can be found in the Supplementary File.

## 5 Conclusion

In this study, we propose a Nyström subspace learning framework (NSL) to ease the application of the Nyström method upon large-scale kernel SVMs. Based on our analysis, the bounds developed for the Nyström method are closely connected to NSL. The main idea of the proposed NSL is that it closely relates the Nyström method to KPCA, which has shown to be able to uncover the relationships among NCR, the standard Nyström and LLA. The conclusions include: 1) NCR NSL linear SVM learning LLA, which tells that although NCR is designed for regularizing the training phrase, it implicitly performs NSL for all data as the first step. 2) Both NCR and the standard Nyström upon kernel SVMs can be efficiently implemented by using the off-the-shelf linear SVM solvers as a black box. 3) When the Gram matrix approximation error is sufficiently small, the difference between NCR and the standard Nyström would be negligible, which is supported by the empirical results. Besides, we also demonstrate how NSL can be used to develop sharper theoretical results for the clustered Nyström method by eventually providing a sharper bound for the corresponding Gram matrix approximation. As provided by our empirical study, depending on the learning task, NCR could perform significantly better or even worst than the standard Nyström. Therefore, it is interesting to further explore the differences between NCR and the standard Nyström.

See pages 1- of supp.pdf

### Footnotes

1. LIBSVM archive: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

### References

1. K-means++: the advantages of careful seeding. Technical report Stanford. Cited by: Corollary 1.
2. Improved matrix algorithms via the subsampled randomized hadamard transform. SIAM Journal on Matrix Analysis and Applications 34 (3), pp. 1301–1340. Cited by: §3.2.
3. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research 11 (Apr), pp. 1471–1490. Cited by: §1.
4. On the impact of kernel approximation on learning accuracy. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 113–120. Cited by: §1, Remark 3.
5. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research 13 (Dec), pp. 3475–3506. Cited by: §2.1.
6. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research 6 (Dec), pp. 2153–2175. Cited by: §1, §2.1.
7. Revisiting the nyström method for improved large-scale machine learning. The Journal of Machine Learning Research 17 (1), pp. 3977–4041. Cited by: §1, §2.1, §2.1.
8. Linearized kernel dictionary learning. IEEE Journal of Selected Topics in Signal Processing 10 (4), pp. 726–739. Cited by: §1.
9. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: §3.2.
10. Improved bounds for the nyström method with application to kernel classification. IEEE Transactions on Information Theory 59 (10), pp. 6939–6949. Cited by: §1, §2.2, §2.2, §4.
11. Ensemble nyström method. In Advances in Neural Information Processing Systems, pp. 1060–1068. Cited by: §2.1.
12. Sampling methods for the nyström method. Journal of Machine Learning Research 13 (Apr), pp. 981–1006. Cited by: §1, §1.
13. Scaling up kernel svm on limited resources: a low-rank linearization approach. IEEE Transactions on Neural Networks and Learning Systems 30 (2), pp. 369–378. Cited by: §1, §2.3, §3.3.
14. Nyström method with kernel k-means++ samples as landmarks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2652–2660. Cited by: §1, §2.1, Remark 6.
15. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §1, §1.
16. Less is more: nyström computational regularization. In Advances in Neural Information Processing Systems, pp. 1657–1665. Cited by: §1, §2.2, §2.2, §3.5, §4.
17. Kernel principal component analysis. In International conference on artificial neural networks, pp. 583–588. Cited by: Remark 1.
18. Computationally efficient nyström approximation using fast transforms. In International Conference on Machine Learning, pp. 2655–2663. Cited by: §2.1.
19. Gain with no pain: efficient kernel-pca by nyström sampling. arXiv preprint arXiv:1907.05226. Cited by: Remark 1.
20. A review of nyström methods for large-scale machine learning. Information Fusion 26, pp. 36–48. Cited by: §1, §1.
21. Scalable kernel k-means clustering with nyström approximation: relative-error bounds. The Journal of Machine Learning Research 20 (1), pp. 431–479. Cited by: Remark 6.
22. Improving cur matrix decomposition and the nyström approximation via adaptive sampling. The Journal of Machine Learning Research 14 (1), pp. 2729–2769. Cited by: §2.1.
23. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pp. 682–688. Cited by: §1, §1, §2.1.
24. Nyström method vs random fourier features: a theoretical and empirical comparison. In Advances in neural information processing systems, pp. 476–484. Cited by: §1, §3.5.
25. Clustered nyström method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks 21 (10), pp. 1576–1587. Cited by: §1, §2.1, §3.7, Remark 6.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters