1 Introduction
Abstract

Most kernel-based methods, such as kernel or Gaussian process regression, kernel PCA, ICA, or -means clustering, do not scale to large datasets, because constructing and storing the kernel matrix requires at least time and space for samples. Recent works [1, 9] show that sampling points with replacement according to their ridge leverage scores (RLS) generates small dictionaries of relevant points with strong spectral approximation guarantees for . The drawback of RLS-based methods is that computing exact RLS requires constructing and storing the whole kernel matrix. In this paper, we introduce SQUEAK, a new algorithm for kernel approximation based on RLS sampling that sequentially processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension of the dataset. Moreover since all the RLS estimations are efficiently performed using only the small dictionary, SQUEAK is the first RLS sampling algorithm that never constructs the whole matrix , runs in linear time w.r.t. , and requires only a single pass over the dataset. We also propose a parallel and distributed version of SQUEAK that linearly scales across multiple machines, achieving similar accuracy in as little as time.

\aistatstitle

Distributed Adaptive Sampling for Kernel Matrix Approximation

\aistatsauthor

Daniele Calandriello &Alessandro Lazaric &Michal Valko \aistatsaddressSequeL team, INRIA Lille - Nord Europe

1 Introduction

One of the major limits of kernel ridge regression (KRR), kernel PCA [15], and other kernel methods is that for samples storing and manipulating the kernel matrix requires space, which becomes rapidly infeasible for even a relatively small . For larger sizes (or streams) we cannot even afford to store or process the data on as single machine. Many solutions focus on how to scale kernel methods by reducing its space (and time) complexity without compromising the prediction accuracy. A popular approach is to construct low-rank approximations of the kernel matrix by randomly selecting a subset (dictionary) of columns from , thus reducing the space complexity to . These methods, often referred to as Nyström approximations, mostly differ in the distribution used to sample the columns of and the construction of low-rank approximations. Both of these choices significantly affect the accuracy of the resulting approximation [14]. Bach [2] showed that uniform sampling preserves the prediction accuracy of KRR (up to ) only when the number of columns is proportional to the maximum degree of freedom of the kernel matrix. This may require sampling columns in datasets with high coherence [6], i.e., a kernel matrix with weakly correlated columns. On the other hand, Alaoui and Mahoney [1] showed that sampling columns according to their ridge leverage scores (RLS) (i.e., a measure of the influence of a point on the regression) produces an accurate Nyström approximation with only a number of columns  proportional to the average degrees of freedom of the matrix, called effective dimension. Unfortunately, the complexity of computing RLS requires storing the whole kernel matrix, thus making this approach infeasible. However, Alaoui and Mahoney [1] proposed a fast method to compute a constant-factor approximation of the RLS and showed that accuracy and space complexity are close to the case of sampling with exact RLS at the cost of an extra dependency on the inverse of the minimal eigenvalue of the kernel matrix. Unfortunately, the minimal eigenvalue can be arbitrarily small in many problems. Calandriello et al. [3] addressed this issue by processing the dataset incrementally and updating estimates of the ridge leverage scores, effective dimension, and Nyström approximations on-the-fly. Although the space complexity of the resulting algorithm (INK-Estimate) does not depend on the minimal eigenvalue anymore, it introduces a dependency on the largest eigenvalue of , which in the worst case can be as big as , thus losing the advantage of the method. In this paper we introduce an algorithm for SeQUEntial Approximation of Kernel matrices (SQUEAK), a new algorithm that builds on INK-Estimate, but uses unnormalized RLS. This improvement, together with a new analysis, opens the way to major improvements over current leverage sampling methods (see Sect. 6 for a comparison with existing methods) closely matching the dictionary size achieved by exact RLS sampling. First, unlike INK-Estimate, SQUEAK is simpler, does not need to compute an estimate of the effective dimension for normalization, and exploits a simpler, more accurate RLS estimator. This new estimator only requires access to the points stored in the dictionary. Since the size of the dictionary is much smaller than the , SQUEAK needs to actually observe only a fraction of the kernel matrix , resulting in a runtime linear in . Second, since our dictionary updates require only access to local data, our algorithm allows for distributed processing where machines operating on different dictionaries do not need to communicate with each other. In particular, intermediate dictionaries can be extracted in parallel from small portions of the dataset and they can be later merged in a hierarchical way. Third, the sequential nature of SQUEAK requires a more sophisticated analysis that take into consideration the complex interactions and dependencies between successive resampling steps. The analysis of SQUEAK builds on a new martingale argument that could be of independent interest for similar online resampling schemes. Moreover, our SQUEAK can naturally incorporate new data without the need of recomputing the whole resparsification from scratch and therefore it can be applied in streaming settings. We note there exist other ways to avoid the intricate dependencies with simpler analysis, for example by resampling [9], but with negative algorithmic side effects: these methods need to pass through the dataset multiple times. SQUEAK passes through the dataset only once111Note that there is an important difference in whether the method passes through kernel matrix only once or through the dataset only once, in the former, the algorithm may still need access one data point up to times, thus making it unsuitable for the streaming setting and less practical for distributed computation. and is therefore the first provably accurate kernel approximation algorithm that can handle both streaming and distributed settings.

2 Background

In this section, we introduce the notation and basics of kernel approximation used through the paper.

Notation. We use curly capital letters for collections. We use upper-case bold letters for matrices and operators, lower-case bold letters for vectors, lower-case letters for scalars, with the exception of and which denote functions, and for the set of integers between 1 and . We denote by and , the element of a matrix and th element of a vector respectively. We denote by , the identity matrix of dimension and by the diagonal matrix with the vector on the diagonal. We use to denote the indicator vector for element of dimension . When the dimension of and is clear from the context, we omit the . We use to indicate that is a Positive Semi-Definite (PSD) operator.

Kernel. We consider a positive definite kernel function and we denote with its induced Reproducing Kernel Hilbert Space (RKHS), and with its corresponding feature map. Using , and without loss of generality, for the rest of the paper we will replace with a high dimensional space where is large and potentially infinite. With this notation, the kernel evaluated between to points can be expressed as . Given a dataset of points , we define the (empirical) kernel matrix as the application of the kernel function on all pairs of input values (i.e., for any ), with as its -th column. We also define the feature vectors and after introducing the matrix we can rewrite the kernel matrix as .

Kernel approximation by column sampling. One of the most popular strategies to have low space complexity approximations of the kernel is to randomly select a subset of its columns (possibly reweighted) and use them to perform the specific kernel task at hand (e.g., kernel regression). More precisely, we define a column dictionary as a collection , where the first term denotes the index of the column and its weight, which is set to zero for all columns that are not retained. For the theoretical analysis, we conveniently keep the dimension of any dictionary to , while in practice, we only store the non-zero elements. In particular, we denote by be the size of the dictionary corresponding to the elements with non-zero weights . Associated with a column dictionary, there is a selection matrix such that for any matrix , returns a matrix where the columns selected by are properly reweighted and all other columns are set to 0. Despite the wide range of kernel applications, it is possible to show that in most of them, the quality of a dictionary can be measured in terms of how well it approximates the projection associated to the kernel. In kernel regression, for instance, we use to construct the projection (hat) matrix that projects the observed labels to . In particular, let be the projection matrix (where indicates the pseudoinverse), then . If is full-rank, then is the identity matrix and, we can reconstruct any target vector exactly. On the other hand, the only sampling scheme which guarantees to properly approximate a full rank requires all columns to be represented in . In fact, all columns have the same “importance” and no low-space approximation is possible. Nonetheless, kernel matrices are often either rank deficient or have extremely small eigenvalues (exponentially decaying spectrum), as a direct (and desired) consequence of embedding low dimensional points into a high dimensional RKHS. In this case, after soft thresholding the smaller eigenvalues to a given value , can be effectively approximated using a small subset of columns. This is equivalent to approximating the -ridge projection matrix

We say that a column dictionary is accurate if the following condition is satisfied.

Definition 1.

A dictionary and its associated selection matrix are -accurate w.r.t. a kernel matrix if 222the matrix norm we use is the operator (induced) norm

(1)

where for a given , the approximated projection matrix is defined as

Notice that this definition of accuracy is purely theoretical, since is never computed. Nonetheless, as illustrated in Sect. 5, -accurate dictionaries can be used to construct suitable kernel approximation in a wide range of problems.

Ridge leverage scores sampling. Alaoui and Mahoney [1] showed that an -accurate dictionary can be obtained by sampling columns proportionally to their -ridge leverage scores (RLS) defined as follows.

Definition 2.

Given a kernel matrix , the -ridge leverage score (RLS) of column is

(2)

Furthermore, the effective dimension of the kernel matrix is defined as

(3)

The RLS can be interpreted and derived in many ways, and they are well studied [5, 4, 18] in the linear setting (e.g. ). Patel et al. [11] used them as a measure of incoherence to select important points, but their deterministic algorithm provides guarantees only when is exactly low-rank. Here we notice that

which means that they correspond to the diagonal elements of the itself. Intuitively, this correspond to selecting each column with probability will capture the most important columns to define , thus minimizing the approximation error . More formally, Alaoui and Mahoney [1] state the following.

Proposition 1.

Let and be the dictionary built with columns randomly selected proportionally to RLSs with weight . If , then w.p. at least , the corresponding dictionary is -accurate.

Unfortunately, computing exact RLS requires storing and this is seldom possible in practice. In the next section, we introduce SQUEAK, an RLS-based incremental algorithm able to preserve the same accuracy of Prop. 1 without requiring to know the RLS in advance. We prove that it generates a dictionary only a constant factor larger than exact RLS sampling.

3 Sequential RLS Sampling

0:  Dataset , parameters
0:  
1:  Initialize as empty, (see Thm. 1)
2:  for  do
3:     Read point from
4:      \hfillExpand
5:      using Eq. 3
6:  end for
Algorithm 1 The SQUEAK algorithm
0:  
0:  
1:  Initialize
2:  for all  do  \hfillShrink
3:     if  then
4:        Compute using
5:        Set
6:        Set
7:     else
8:         and
9:     end if
10:  end for
Subroutine 1 The Dict-Update algorithm

In the previous section, we showed that sampling proportionally to the RLS leads to a dictionary such that . Furthermore, since the RLS correspond to the diagonal entries of , an accurate approximation may be used in turn to compute accurate estimates of . The SQUEAK algorithm (Alg. 1) builds on this intuition to sequentially process the kernel matrix so that exact RLS computed on a small matrix ( with ) are used to create an -accurate dictionary, which is then used to estimate the RLS for bigger kernels, which are in turn used to update the dictionary and so on. While SQUEAK shares a similar structure with INK-Estimate [3], the sampling probabilities are computed from different estimates of the RLS and no renormalization by an estimate of is needed. Before giving the details of the algorithm, we redefine a dictionary as a collection , where is the index of the point stored in the dictionary, tracks the probability used to sample it, and is the number of copies (multiplicity) of . The weights are then computed as , where is an algorithmic parameter discussed later. We use to stress the fact that these probabilities will be computed as approximations of the actual probabilities that should be used to sample each point, i.e., their RLS .

SQUEAK receives as input a dataset and processes it sequentially. Starting with an empty dictionary , at each time step , SQUEAK receives a new point . Adding a new point to the kernel matrix can either decrease the importance of points observed before (i.e., if they are correlated with the new point) or leave it unchanged (i.e., if their corresponding kernel columns are orthogonal) and thus for any , the RLS evolves as follows.

Lemma 1.

For any kernel matrix at time and its extension at time , we have that the RLS are monotonically decreasing and the effective dimension is monotonically increasing,

The previous lemma also shows that the RLS cannot decrease too quickly and since , they can at most halve when . After receiving the new point , we need to update our dictionary to reflect the changes of the . We proceed in two phases. During the Expand phase, we directly add the new element to and obtain a temporary dictionary , where the new element is added with a sampling probability and a number of copies , i.e., . This increases our memory usage, forcing us to update the dictionary using Dict-Update, in order to decrease its size. Given as input , we use the following estimator to compute the approximate RLS ,

(4)

where is the accuracy parameter, is the regularization and is the selection matrix associated to . This estimator follows naturally from a reformulation of the RLS. In particular, if we consider , the RKHS representation of , the RLS can be formulated as , where we see that the importance of point is quantified by how orthogonal (in the RKHS) it is w.r.t. the other points. Because we do not have access to all the columns (), similarly to what [4] did for the special case , we choose to use , and then we use the kernel trick to derive a form that we can actually compute, resulting in Eq. 3. The approximate RLSs are then used to define the new sampling probabilities as . For each element in , the Shrink step draws a sample from the binomial , where the minimum taken in the definition of ensures that the binomial probability is well defined (i.e., ). This resampling step basically tracks the changes in the RLS and constructs a new dictionary , which is as if it was created from scratch using all the RLS up to time  (with high probability). We see that the new element  is only added to the dictionary with a large number of copies (from 0 to ) if its estimated relevance is high, and that over time elements originally in are stochastically reduced to reflect the reductions of the RLSs. The lower w.r.t. , the lower the number of copies w.r.t. . If the probability continues to decrease over time, then  may become zero, and the column is completely dropped from the dictionary (by setting its weight to zero). The approximate RLSs enjoy the following guarantees.

Lemma 2.

Given an -approximate dictionary of matrix , construct by adding element to it, and compute the selection matrix . Then for all in such that , the estimator in Eq. 3 is -accurate, i.e., it satisfies , with . Moreover, given RLS and , and two -accurate RLSs, and , the quantity is also an -accurate RLS.

This result is based on the property that whenever is -accurate for , the projection matrix can be approximated by constructed using the temporary dictionary and thus, the RLSs can be accurately estimated and used to update and obtain a new -accurate dictionary for . Since is used to sample the new dictionary , we need each point to be sampled almost as frequently as with the true RLS , which is guaranteed by the lower bound of Lem. 2. Since RLSs are always smaller or equal than 1, this could be trivially achieved by setting to 1. Nonetheless, this would keep all columns in the dictionary. Consequently, we need to force the RLS estimate to decrease as much as possible, so that low probabilities allow reducing the space as much as possible. This is obtained by the upper bound in Lem. 2, which guarantees that the estimated RLS are always smaller than the exact RLS. As a result, Shrink sequentially preserves the overall accuracy of the dictionary and at the same time keeps its size as small as possible, as shown in the following theorem.

Theorem 1.

Let be the accuracy parameter, the regularization, and the probability of failure. Given an arbitrary dataset in input together with parameters , , and , we run SQUEAK with

where . Then, w.p. at least , SQUEAK generates a sequence of random dictionaries that are -accurate (Eq. 1) w.r.t. any of the intermediate kernels , and the size of the dictionaries is bounded as

As a consequence, on a successful run the overall complexity of SQUEAK is bounded as

space complexity
time complexity

We show later that Thm. 1 is special case of Thm. 2 and give a sketch of the proof with the statement of Thm. 2. We postpone the discussion about this result and the comparison with previous results to Sect. 6 and focus now on the space and time complexity. Note that while the dictionaries always contain elements for notational convenience, Shrink actually never updates the probabilities of the elements with . This feature is particularly important, since at any step , it only requires to compute approximate RLSs for the elements which are actually included in and the new point (i.e., the elements in ) and thus it does not require recomputing the RLSs of points () that have been dropped before! This is why SQUEAK computes an -accurate dictionary with a single pass over the dataset. Furthermore, the estimator in Eq. 3 does not require computing the whole kernel column of dimension . In fact, the components of , corresponding to points which are no longer in , are directly set to zero when computing . As a result, for any new point we need to evaluate only for the indices in . Therefore, SQUEAK never performs more than kernel evaluations, which means that it does not even need to observe large portions of the kernel matrix. Finally, the runtime is dominated by the matrix inversions used in Eq. 3. Therefore, the total runtime is . In the next section, we introduce DISQUEAK, which improves the runtime by independently constructing separate dictionaries in parallel and then merging them recursively to construct a final -accurate dictionary.

4 Distributed RLS Sampling

0:  Dataset , parameters
0:  
1:  Partition into disjoint sub-datasets
2:  Initialize
3:  Build set
4:  for  do
5:     if  then  \hfill\eqparboxCOMMENTDict-Merge
6:        Pick two dictionaries from
7:        
8:         using Eq. 5
9:        Place back into
10:     else
11:        
12:     end if
13:  end for
14:  Return , the last dictionary in
Algorithm 2 The distributed SQUEAK algorithm

In this section, we show that a minor change in the structure of SQUEAK allows us to parallelize and distribute the computation of the dictionary over multiple machines, thus reducing even further its time complexity. Beside the computational advantage, a distributed architecture is needed as soon as the input dimension and the number of points is so large that having the dataset on a single machine is impossible. Furthermore, distributed processing can reduce contention color=citrinecolor=citrinetodo: color=citrinei guess this is some distributed lingo? on bottleneck data sources such as databases or network connections. DIstributed-SQUEAK (DISQUEAK, Alg. 2) partitions over multiple machines and the (small) dictionaries that are generated from different portions of the dataset are integrated in a hierarchical way. The initial dataset is partitioned over disjoint sub-datasets with and dictionaries are initialized simply by placing all samples in into with weight 1 and multiplicity . Alternatively, if the datasets are too large to fit in memory, we can run SQUEAK to generate the initial dictionaries. The dictionaries are added to a dictionary collection , and at each step Alg. 2 arbitrarily chooses two dictionaries and from , and merges them. Dict-Merge first combines them into a single dictionary (the equivalent of the Expand phase in SQUEAK) and then Dict-Update is run on the merged dictionaries to create an updated dictionary , which is placed back in the dictionary collection . This sequence of merges can be represented using a binary merge tree, as in Fig. 1. Since Dict-Merge only takes the two dictionaries as input and does not require any information on the dictionaries in the rest of the tree, separate branches can be run simultaneously on different machines, and only the resulting (small) dictionary needs to be propagated to the parent node for the future Dict-Merge. Unlike in SQUEAK, Dict-Update is run on the union of two distinct dictionaries rather than one dictionary and a new single point. As a result, we need to derive the “distributed” counterparts of Lemmas 1 and 2 to analyze the behavior of the RLSs and the quality of the estimator.

Lemma 3.

Given two disjoint datasets , for every , and

color=bluedcolor=bluedtodo: color=bluedneed to reflect the fact that we don’t need bound on RLS decrease anymore

While in SQUEAK we were merging an -accurate dictionary and a new point, which is equivalent to a perfect, -accurate dictionary, in DISQUEAK both dictionaries used in a merge are only -accurate. To balance this change, we introduce a new estimator,

(5)

where is the selection matrix associated with the temporary dictionary . Eq. 5 has similar guarantees as Lem. 2, with only a slightly larger .

Figure 1: Merge tree for Alg. 2 with an arbitrary partitioning and merging scheme.
Lemma 4.

Given two disjoint datasets , and two -approximate dictionaries , , let and be the associated selection matrix. Let be the kernel matrix computed on , its -th column, and the RLS of . Then for all in such that , the estimator in Eq. 5 is -accurate, i.e. it satisfies , with

Given these guarantees, the analysis of DISQUEAK follows similar steps as SQUEAK. Given -accurate dictionaries, we obtain -accurate RLS estimates that can be used to resample all points in and generate a new dictionary that is -accurate. To formalize a result equivalent to Thm. 1, we introduce additional notation: We index each node in the merge tree by its height and position . We denote the dictionary associated to node by and the collection of all dictionaries available at height  of the merge tree by . We also use to refer to the kernel matrix constructed from the datasets , which contains all points present in the leaves reachable from node . For instance in Fig. 1, node is associated with , which is an -approximate dictionary of the kernel matrix constructed from the dataset . contains , , (descendent nodes are highlighted in red) and it has dimension . Theorem 2 summarizes the guarantees for DISQUEAK.

Theorem 2.

Let be the accuracy parameter, the regularization factor, and the prob. of failure. Given an arbitrary dataset and a merge tree structure of height as input together with parameters , , and , we run DISQUEAK with

where . Then, w.p. at least , DISQUEAK generates a sequence of collections of dictionaries such that each dictionary in is -accurate (Eq. 1) w.r.t. to , and that at any node of height the size of the dictionary is bounded as The cumulative (across nodes) space and time requirementsof the algorithm depend on the exact shape of the merge tree.

Theorem 2 gives approximation and space guarantees for every node of the tree. In other words, it guarantees that each intermediate dictionary processed by DISQUEAK is both an -accurate approximation of the datasets used to generate it, and requires a small space proportional to the effective dimension of the same dataset. From an accuracy perspective, DISQUEAK provides exactly the same guarantees of SQUEAK. Analysing the complexity of DISQUEAK is however more complex, since the order and arguments of the Dict-Update operations is determined by the merge tree. We distinguish between the time and work complexity of a tree by defining the time complexity as the amount of time necessary to compute the final solution, and the work complexity as the total amount of operations carried out by all machines in the tree in order to compute the final solution. We consider two special cases, a fully balanced tree (all inner nodes have either two inner nodes as children or two leaves), and a fully unbalanced tree (all inner nodes have exactly one inner node and one leaf as children). For both cases, we consider trees where each leaf dataset contains a single point . In the fully unbalanced tree, we always merge the current dictionary with a new dataset (a single new point) and no Dict-Merge operation can be carried out in parallel. Unsurprisingly, the sequential algorithm induced by this merge tree is strictly equivalent to SQUEAK. Computing a solution in the fully unbalanced tree takes time with a total work that is also , as reported in Thm. 1. On the opposite end, the fully balanced tree needs to invert a dimensional matrix at each layer of the tree for a total of layers. Bounding all with , gives a complexity for computing the final solution of time, with a huge improvement on SQUEAK. Surprisingly, the total work is only twice , since at each layer we perform inversions (on machines), and the sum across all layers is . Therefore, we can compute a solution in a much shorter time than SQUEAK, with a comparable amount of work, but at the expense of requiring much more memory across multiple machines, since at layer , the sum can be much larger than . Nonetheless, this is partly alleviated by the fact that each node locally requires only memory.

Proof sketch: Although DISQUEAK is conceptually simple, providing guarantees on its space/time complexity and accuracy is far from trivial. The first step in the proof is to carefully decompose the failure event across the whole merge tree into separate failure events for each merge node , and for each node construct a random process that models how Alg. 2 generates the dictionary . Notice that these processes are sequential in nature and the various steps (layers in the tree) are not i.i.d. Furthermore, the variance of is potentially large, and cannot be bounded uniformly. Instead, we take a more refined approach, inspired by Pachocki [10], that 1) uses Freedman’s inequality to treat , the variance of process , as a random object itself, 2) applies a stochastic dominance argument to to reduce it to a sum of i.i.d. r.v. and only then we can 3) apply i.i.d. concentrations to obtain the desired result.

\setstretch

1.3 Increm. Exact - Uniform (Bach [2]) No Alaoui and Mahoney [1] No Calandriello et al. [3] Yes SQUEAK Yes DISQUEAK Yes RLS-sampling -

Table 1: Comparison of Nyström methods. and refer to largest and smallest eigenvalues of .

5 Applications

In this section, we show how our approximation guarantees translate into guarantees for typical kernel methods. As an example, we use kernel ridge regression. We begin by showing how to get an accurate approximation from an -accurate dictionary.

Lemma 5.

Given an -accurate dictionary of matrix , and the selection matrix , the regularized Nyström approximation of is defined as

(6)

and satisfies

(7)

This is not the only choice of an approximation from dictionary . For instance, Musco and Musco [9] show similar result for an unregularized Nyström approximation and Rudi et al. [14] for a smaller , construct the estimator only for the points in . Let , , with , and using the Woodbury formula define the regression weights as

(8)

Computing , inverting it, and the other matrix-matrix multiplication take time, and require to store at most an matrix. Therefore the final complexity of computing is reduced from to time, and from to space. We now provide guarantees for the empirical risk of  in a fixed design setting.

Corollary 1 ([1, Thm. 3]).

For an arbitrary dataset , let be the kernel matrix constructed on . Run SQUEAK or DISQUEAK with regularization parameter . Then, the solution computed using the regularized Nyström approximation satisfies

where is the regularization of kernel ridge regression and is the empirical risk on .

Using tools from [14], and under some mild assumption on the kernel function and the dataset , similar results can be derived for the random design setting.

Other applications. The projection naturally appears in some form across nearly all kernel-based methods. Therefore, in addition to KRR in the fixed [1] and random [14] design setting, any kernel matrix approximation that provides -accuracy guarantees on can be used to provide guarantees for a variety of other kernel problems. As an example, Musco and Musco [9] show this is the case for kernel PCA [15], kernel CCA with regularization, and kernel -means clustering. Similarly, inducing points methods for Gaussian Processes [12, 13] can benefit from fast and provably accurate dictionary construction.

6 Discussion

Tab. 1 compares several kernel approximation methods w.r.t. the time complexity required to compute an -accurate dictionary, as well as the size of the final dictionary . Note that for all methods the final complexity to construct an approximate from (e.g. using Eq. 6) scales as time and space. For all methods, we omit factors. We first report RLS-sampling, a fictitious algorithm that receives the exact RLSs as input, as an ideal baseline for all RLS sampling algorithms. The space complexity of uniform sampling [2] scales with the maximal degree of freedom . Since , uniform sampling is often outperformed by RLS sampling. Moreover, uniform sampling needs to know in advance to guarantee -accuracy, and this quantity is expensive to compute [18]. While Alaoui and Mahoney [1] also sample according to RLS, their two-pass estimator does not preserve the same level of accuracy. In particular, the first pass requires to sample columns, which quickly grows above when becomes small. Finally, Calandriello et al. [3] require that the maximum dictionary size is fixed in advance, which implies some information about the effective dimensions , and requires estimating both and . This extra estimation effort causes an additional factor to appear in the space complexity. This factor cannot be easily estimated, and it leads to a space complexity of in the worst case. Therefore, we can see from the table that SQUEAK achieves the same space complexity (up to constant factors) as knowing the RLS in advance and hence outperforms previous methods. Moreover, when parallelized across machines, DISQUEAK can reduce this linear runtime by a factor of (linear scaling) with little communication cost.

A recent method by Musco and Musco [9] achieves comparable space and time guarantees as SQUEAK.333The technical report of Musco and Musco [9] was developed independently from our work. While they rely on a similar estimator, the two approaches are very different. Their method is batch in nature, estimating leverage scores using repeated independent sampling from the whole dataset, and it requires multiple passes on the data. On the other hand, SQUEAK is intrinsically sequential and it only requires one single pass on , as points are “forgotten” once they are dropped from the dictionary. Furthermore, the different structure requires remarkably different tools for the analysis. While the method of [9] can directly use i.i.d. concentration inequalities (for the price of needing several passes), we need to rely on more sophisticated martingale arguments to consider the sequential stochastic process of SQUEAK. Furthermore, Musco and Musco [9] requires centralized coordination after each of the sampling passes, and cannot leverage distributed architectures to match DISQUEAK’s sub-linear runtime.

Future developments Both SQUEAK and DISQUEAK need to know in advance the size of the dataset to tune . An interesting question is whether it is possible to adaptively adjust the parameter at runtime. This would allow us to continue updating and indefinitely process new data beyond the initial dataset . It is also interesting to see whether SQUEAK could be used in conjuntion with existing meta-algorithms (e.g., [7] with model averaging) for kernel matrix approximation that can leverage an accurate sampling scheme as a black-box, and what kind of improvements we could obtain.

Acknowledgements

The research presented was supported by French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003)

References

  • Alaoui and Mahoney [2015] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical guarantees. In Neural Information Processing Systems, 2015.
  • Bach [2013] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, 2013.
  • Calandriello et al. [2016] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Analysis of Nyström method with sequential ridge leverage scores. In Uncertainty in Artificial Intelligence, 2016.
  • Cohen et al. [2016] Michael B. Cohen, Cameron Musco, and Jakub Pachocki. Online row sampling. International Workshop on Approximation, Randomization, and Combinatorial Optimization, 2016.
  • Cohen et al. [2017] Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Symposium on Discrete Algorithms, 2017.
  • Gittens and Mahoney [2013] Alex Gittens and Michael W Mahoney. Revisiting the Nyström method for improved large-scale machine learning. In International Conference on Machine Learning, 2013.
  • Kumar et al. [2012] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the Nyström method. Journal of Machine Learning Research, 13(Apr):981–1006, 2012.
  • Levy [2015] Haim Levy. Stochastic dominance: Investment decision making under uncertainty. Springer, 2015.
  • Musco and Musco [2016] Cameron Musco and Christopher Musco. Provably useful kernel matrix approximation in linear time. Technical report, 2016. URL http://arxiv.org/abs/1605.07583.
  • Pachocki [2016] Jakub Pachocki. Analysis of resparsification. Technical report, 2016. URL http://arxiv.org/abs/1605.08194.
  • Patel et al. [2015] Raajen Patel, Thomas A. Goldstein, Eva L. Dyer, Azalia Mirhoseini, and Richard G. Baraniuk. oASIS: Adaptive column sampling for kernel matrix approximation. Technical report, 2015. URL http://arxiv.org/abs/1505.05208.
  • Quiñonero-Candela and Rasmussen [2005] Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005. URL http://www.jmlr.org/papers/v6/quinonero-candela05a.html.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9. OCLC: ocm61285753.
  • Rudi et al. [2015] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Neural Information Processing Systems, 2015.
  • Schölkopf et al. [1999] Bernhard Schölkopf, Alexander J. Smola, and Klaus-Robert Müller. Kernel principal component analysis. In Advances in kernel methods, pages 327–352. MIT Press Cambridge, MA, USA, 1999.
  • Tropp [2011] Joel Aaron Tropp. Freedman’s inequality for matrix martingales. Electronic Communications in Probability, 16:262–270, 2011.
  • Tropp [2015] Joel Aaron Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.
  • Woodruff et al. [2014] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.
Notation summary
Kernel matrix at time , -th column , -th entry
Eigendecomposition , eigenvector matrix and diagonal SDP eigenvalues matrix .
Kernel matrix at time in the RKHS, , with
SVD decomposition with left singular vectors, , right singular vector (same as ), and singular values ( on the main diagonal and zeros under it)
SVD decomposition with left singular vectors, , right singular vector (same as ), and singular values ( on the main diagonal and zeros under it)
with
Column dictionary at time ,
Selection matrix with