Nyström landmark sampling and regularized Christoffel functions

Nyström landmark sampling and regularized Christoffel functions

M. Fanuelfootnotetext: michael.fanuel@kuleuven.be J. Schreurs J.A.K. Suykens KU Leuven, Department of Electrical Engineering (ESAT),
STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics,
Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
today
Abstract

Selecting diverse and important items from a large set is a problem of interest in machine learning. As a specific example, in order to deal with large training sets, kernel methods often rely on low rank matrix approximations based on the selection or sampling of Nyström centers. In this context, we propose a deterministic and a randomized adaptive algorithm for selecting landmark points within a training dataset, which are related to the minima of a sequence of Christoffel functions in Reproducing Kernel Hilbert Spaces. Beyond the known connection between Christoffel functions and leverage scores, a connection of our method with determinantal point processes (DPP) is also explained. Namely, our construction promotes diversity among important landmark points in a way similar to DPPs.

1 Introduction

Two main approaches for large scale kernel methods are often considered: random features RandomFeatures () and the Nyström approximation WilliamSeeger () that we consider more specifically in this paper. Selecting good landmark points is a question of major interest for what concerns this method. Let us mention some approaches for this problem. For matrix approximation, the so-called “statistical leverage scores”, which intuitively correspond to the correlation between the singular vectors of a matrix and the canonical basis, are quantities of interest for designing randomized matrix algorithms. In the context of large scale kernel methods, recent efficient approaches indeed consider leverage score sampling Gittens2016 (); Drineas:2005 (); Bach2013 () which result in various theoretical bounds on the quality of the Nyström approximation. Several refined methods also involve recursive sampling procedures ElAlaouiMahoney (); MuscoMusco (); Rudi2017 (); Rudi2018 () yielding statistical guarantees. In view of certain applications where random sampling is less convenient, deterministic landmark selection based on leverage scores can also be used as for instance in Belabbas369 (); Papailiopoulos (); McCurdy (), whereas greedy approaches were also proposed GreedyNystr (). On the other hand, while methods based on leverage scores often select “important” landmarks, it is intuitively interesting to promote samples with diverse landmarks. In the same spirit, the selection of informative data points Guyon:1996 (); gauthier2018optimal () is an interesting problem which also motivated entropy-based landmark selection methods suykens2002 (); Girolami (). Selecting diverse datapoints by using Determinantal Point Processes (DPP) has been a topic of active research hough2006 (); KuleszaT12 (). DPPs are ingenious probabilistic models of repulsion inspired from physics, which are kernel-based. While the number of sampled points generated by a DPP is also random, it is possible to define -Determinantal Point Processes (-DPP) which only generate samples of points kulesza2011kdpps (). However, sampling exactly a DPP or a -DPP takes operations and therefore several improvements or approximations of the sampling algorithm have been studied Fastdpp (); Efficientkdpp (); Tremblay (). This paper explains how DPPs and leverage scores sampling methods can be connected in the context of Christoffel functions, which were proposed in the last years as a tool for machine learning ChristOutliers (); Pauwels (); Lasserre () and are also interestingly connected to the problem of support estimation RudiSupport ().
Organization of the paper. In Section 2, kernelized Christoffel functions with additional constraints are introduced. This gives a motivation for the algorithms proposed in this paper in connection with diversity and DPP’s. The DAS (Deterministic Adaptive Sampling) and RAS (Randomized Adaptive Sampling) algorithms are analysed respectively in Section 3 and Section 4. Finally, numerical results are given in Section 5. The proofs of our results are given in the Supplementary Material.
Notations. Let be a data set in . For further use, the canonical basis of is denoted by . Let a Reproducing Kernel Hilbert Spaces (RKHS) with a strictly positive definite kernel . Then, it is common to denote for all and to define . Typical examples of such kernels are given by the Gaussian RBF kernel or the Laplace kernel . Let and . Then, denote the sampling operator , such that and its adjoint given by . The max norm and the operator norm of a matrix are given respectively by and . Finally, we write if is positive semi-definite.

2 Landmark selection and Christoffel functions

2.1 Christoffel functions with additional constraints

In the case of an empirical density, the value of the regularized Christoffel function at is obtained as follows:

(1)

where for all . Notice that in the above definition there is an implicit dependence on the parameter , the empirical density and the RKHS itself. As explained in Pauwels (), the function (1) is related to the empirical density in the sense that takes larger values where the dataset is dense and smaller values in sparse regions. This is the intuitive reason why the Christoffel function can be used for outlier detections ChristOutliers () and is related to the inverse of leverage scores, as we recall hereafter. Hence, the function (1) naturally provides importance scores of datapoints. For many datasets the "importance" of points is not uniform over the dataset, which motivates the use of ridge leverage score sampling (RLS) and our proposed methods (see Figure 1).

(a) Uniform
(b) Ridge Leverage Score (RLS)
(c) Deterministic Adaptive (DAS)
Figure 1: Illustration of sampling methods on an artificial dataset. Uniform sampling oversamples dense parts. RLS overcomes this limitation, however landmarks can be close to each other. DAS promotes diversity.

We propose to extend the definition in order to provide importance scores of data points which are in the complement of a set . This consists in excluding points of – and to some extend also their neighbourhood – in order to look for other important points in . Namely, we define a regularized Christoffel function as follows:

(2)

where we restrict here to such that and define for all . By using the representer theorem, we find that there exists an such that the optimal solution reads . In order to rephrase the optimization problem in the RKHS, we introduce the covariance operator and the kernel matrix which share the same non-zero eigenvalues. Then the problem (2) reads:

(3)

In other words, the formulation (3) corresponds to the definition (1) where the RKHS is replaced by a specific subspace of .

In this paper, we propose to solve several problems of the type (3) for different sets and to select landmarks which minimize in the dataset. In order to formulate our main result given in Proposition 1, we introduce the projector kernel matrix, also known as the smoothing kernel:

(4)

which is a regularized form of , where is the Moore-Penrose pseudo-inverse of . When no ambiguity is possible, we will simply write to ease the reading.

Denote and submatrices of the projector kernel, where is a sampling matrix obtained by selecting the columns of the identity matrix indexed by . We emphasize that the product of a kernel matrix with a sampling matrix does not necessarily require that the kernel matrix is completely stored in memory.

Proposition 1.

Let a RKHS with a strictly positive kernel . If is not empty, we have:

(5)

with and where is the optimal value of the problem (2).

The second term in (5) is clearly the Nyström approximation of the projector kernel based on the subset of columns indexed by . Also, notice that (5) is the posterior predictive variance of a Gaussian process regression associated with the projector kernel given in (4) in the absence of regularization rasmussen2006gaussianprocessesformachinelearning (). The accuracy of the Nyström approximation based on the set is controlled thanks to optimal value of the objective (2) in terms of the max norm as it is stated in Corollary 1.

Corollary 1.

Let , then we have

From the computational perspective, although is invertible since we assumed , this matrix may still have numerically very small eigenvalues. The inverse appearing in (5) can be regularized in order to avoid numerical errors. Indeed, we can define a regularized form of the low rank Nyström approximation, namely with and where is a sampling matrix obtained by sampling and possibly rescaling the columns of the identity matrix. It is easy to show that . The hard constraints in the problem (2) can be changed to soft constraints by adding some weights in the objective. The effect of the soft constraints are given in Proposition 2. In other words, the value of is severely penalized on the points when .

Proposition 2.

Let and . Define weights as for all and otherwise. Then, it holds that:

where is a sampling matrix obtained by sampling the columns of the identity indexed by .

It is worth mentioning that the diagonal of yields the so-called leverage scores with (see e.g. MuscoMusco (); Rudi2017 ()), while the trace projector kernel matrix is the effective dimension of a least squares regression problem Zhang (), i.e., . As a matter of fact, the connection remarked in Pauwels () between Christoffel functions and leverage scores corresponds to the choice such that .

2.2 Connection with determinantal processes

Notice that in the particular case where , we find the simple expression:

(6)

which motivates the following comment. The above determinant can be interpreted as the probability that a determinantal point process hough2006 (); KuleszaT12 () draws a sample including and . This DPP is determined by the L-ensemble . Explicitly, the probability that the set is sampled by the DPP is . By inspecting (6), we notice that the determinant can be made larger be considering and such that is small. Hence, maximizing over indeed promotes diversity among the landmarks. Those last remarks motivate Proposition 3, which can also be found in Belabbas369 () in a slightly different setting.

Proposition 3 (Determinant formula).

Let a non-empty set and . Denote . The solution of (2) has the form:

(7)

where is the submatrix of obtained by selected rows and columns indexed by . Furthermore, if we use the spectral decomposition , we have the alternative expression where is the -th column of and is the projector on .

The -ensemble yields the so-called marginal kernel , which allows to define . In other words, we have , where denotes a random sample of the determinantal point process with marginal kernel .

3 Deterministic adaptive landmarks sampling

By using the analogy with ridge leverage scores, we propose to select iteratively landmarks by finding the minima of Christoffel functions defined such that the additional constraints enforces diversity. As it is described in Algorithm 1, we can successively maximize over a nested sequence of sets starting with by adding one landmark at each iteration.

1:input: Kernel matrix , sample size and regularization .
2:initialization: and .
3:while: do
4:  Compute using equation (4).
5:  Find the index .
6:  Update and .
7:end while
8:return .
Algorithm 1 Deterministic adaptive landmarks sampling (DAS).

Algorithm 1 is a greedy reduced basis method as defined in DeVore2013 (), which has been used recently for landmarking on manifolds GaoKovalskyDaubechies () with a different kernel. Clearly, an advantage of working with the projector kernel matrix rather than the kernel matrix is that the linear systems that have to be solved in the adaptive sampling algorithms involve whose largest eigenvalue is bounded by . Therefore, the condition number of the linear system is expected a priori to be smaller than the condition number of the analogue system involving .

Proposition 4 (Convergence of Alg 1).

Let be the eigenvalues of the projector kernel matrix . If is sampled according to Algorithm 1, we have:

Notice that is indeed the largest leverage score related to the maximal marginal degrees of freedom defined in Bach2013 (). Furthermore, we remark that the eigenvalues of are related to the eigenvalues of the kernel matrix by namely, the eigenvalues of are obtained by a Tikhonov filtering of the eigenvalues of . As a consequence, if we assume that the dataset is within an Euclidean ball and choose the Gaussian RBF kernel, we can expect the decay of to be exponential provided that is large enough (cfr. Theorem 3 in altschuler2018massively () and Belkin2018 ()).

In the simulations, we illustrate this remark on datasets where the spectral decay is fast, which seems to be often the case in practice as explained in Papailiopoulos (); McCurdy (). For a set of landmarks , Lemma 1 relates the error on the approximation of to the error on the approximation of .

Lemma 1 (Approximation of the kernel matrix).

Let . Then, it holds that:

with .

This structural result connects the approximation of to the approximation of provided as a function of the regularization . However, the upper bound on the approximation of can be very loose since it is proportional to which can be large in general. In order to have more instructive theoretical guarantees, we now study a randomized version of Algorithm 1.

4 Randomized adaptive landmark sampling

The randomized adaptive sampling algorithm is based on Lemma 2 (cfr. MuscoMusco (); Rudi2017 ()) which relates the -regularized Nyström approximation of to an approximation .

Lemma 2.

Let . Let such that and let be a sampling matrix. Then:

(8)

The idea is to leverage Lemma 2 in order to design a random sampling algorithm where the probability of sampling depends on the lhs of (8) which is closely related to the Christoffel function of Proposition 2. We choose typically and so that . Hence, in this circumstance, the adaptivity in Algorithm 2 promotes diversity among the landmarks.

This procedure is then tightly related to ridge leverage score sampling with respect to . Indeed, the composition rule of projectors is given in Lemma 3.

Lemma 3.

Let and . Then, .

We can now explain how Lemma 2 can be used to obtain an algorithm yielding a Nyström approximation of . Indeed, if such that , we can define . By using the identity we obtain the following factorization . Then, thanks to Lemma 4, we know that if we can sample appropriately the columns of , then the error on the corresponding Nyström approximation will be bounded by a small constant.

Lemma 4 (Kernel approximation ElAlaouiMahoney ()).

Let and . If there is a sampling matrix such that , then

In other words, we want to find a set of landmarks and probabilities such that for with a probability which is ‘not too small’. A major difference with ElAlaouiMahoney (); MuscoMusco () is that the landmarks are not sampled independently in Algorithm 2. In view of Lemma 4, we now describe a sampling procedure for constructing an appropriate sampling matrix, i.e., by adaptively adding a column to the sampling matrix obtained at the previous step. Namely, let be a sampling matrix at iteration . Then, we define the sampling probability:

(9)

where is an oversampling factor MuscoMusco (). Notice that the score (9) is also given by thanks to Lemma 8. Then, Algorithm 2 can be seen as a special case of the online-sample algorithm given OnlineRowSampling ().

1:input: Kernel matrix , oversampling and regularizers and .
2:initialization: empty.
3:for: i=1,…, n do
4:  Compute by using and equation (9).
5:  Update with probability and otherwise.
6:end for
7:return .
Algorithm 2 Randomized adaptive landmarks sampling (RAS).

Our main result indicates that by taking Algorithm 2 produces a sampling matrix allowing for a ‘good’ Nyström approximation of with probability at least .

Theorem 1 (Main result).

Let and . If the oversampling parameter satisfies:

then, with a probability at least , Algorithm 2 yields a sampling matrix such that it holds that as well as

The exact lower bound on in Theorem 1 is given in the supplementary material in terms of the negative branch of the Lambert function Corless () whose asymptotic behaviour is logarithmic. The proof of Theorem 1 uses the martingale-based techniques developed in OnlineRowSampling () in the context of online sampling, together with a refined Freedman-type inequality MINSKER (); tropp2011 () for matrix martingales. An advantage of the Freedman inequality obtained in MINSKER () is that it does not directlty depend on the dimension of the matrices but rather on a certain form of ‘intrinsic’ dimension. The martingale-based techniques allow to deal with the adaptivity of our algorithm which precludes the use of statistical results using iid sampling. We emphasize that the dependence of the effective dimension in Theorem 1 does not appear in OnlineRowSampling () although the proof technique is essentially the same. A different adaptive sampling algorithm is proposed in AdaptiveNIPS () (see also Frieze:2004 ()) where guarantees for the expected error are provided.

5 Numerical results

5.1 Exact algorithms

We evaluate the performance of the deterministic adaptive landmark sampling (DAS) and the randomized variant (RAS) on the Boston housing, Stock, Abalone and Bank 8FM datasets which are described in Table 1. Those public datasets111http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html have been used for benchmarking k-DPPs in Fastdpp (). The implementation of the algorithms is done with matlabR2018b. The quality of the landmarks is evaluated by calculating with with for numerical stability.

(a) Stock (b) Housing (c) Abalone (d) Bank8FM
Figure 2: Relative operator norm of the Nyström approximation error as a function of the number of landmarks. The error is plotted on a logarithmic scale, averaged over 10 trials.
(a) Stock
(b) Housing
(c) Abalone
(d) Bank8FM
Figure 3: Timings for the computations of Figure 2 as a function of the number of landmarks. The timings are plotted on a logarithmic scale, averaged over 10 trials.

Throughout the experiments, we use an Gaussian kernel with a fixed bandwidth (cfr. the end of Section 1) after standardizing the data. First, Algorithm 2 (RAS) is executed with a given and , and returns landmarks. Afterwards, the following algorithms are used to sample landmarks: Uniform sampling (Unif.), k-DPP222We used the matlab code available at https://www.alexkulesza.com/. kulesza2011kdpps (), exact ridge leverage score sampling (RLS) ElAlaouiMahoney (); Gittens2016 (), Algorithm 1 (DAS) and the best rank- approximation using the truncated SVD. Those methods are executed for multiple where the best performing is selected. In particular, the k-DPP the L-ensemble . This procedure is repeated 10 times and the averaged results are visualized in Figure 2.

Dataset n d c Housing 506 13 100 5 Stock 950 10 100 5 Abalone 4177 8 150 5 Bank 8FM 8192 8 150 10 Dataset n d c Adult 48842 14 200 10 MiniBooNE 130065 50 200 5,10 cod-RNA 331152 8 200 4 Covertype 581012 54 200 10
Table 1: Datasets and parameters used for the simulations.

DAS performs well on the Boston housing and Stock dataset, which show a fast decay in the spectrum of (see Supplementary Material). This confirms the expectations from Proposition 4. The accuracy of the method in terms of the max norm is also illustrated in Supplementary Material. If the decay of the eigenvalues is not fast enough, the randomized variant RAS has a superior performance which is similar to the performance of a k-DPP, although RAS is a faster algorithm for obtaining diverse landmarks, as it is illustrated in Figure 3. The main cost of RAS is the calculation of the projector kernel matrix . The exact sampling of a k-DPP has a similar cost. The computer used for these simulations has processors and GB of RAM.

5.2 Approximate RAS

The size of the Nyström factorization might be a memory limitation in the case of large datasets. Therefore, it is advantageous to choose as small as possible. We propose an approximate RAS method. First, the Gaussian kernel matrix is approximated by using random Fourier features RandomFeatures (). In practice, we calculate with random Fourier features such that approximates . Then, we obtain , where the linear system is solved by using a Cholesky decomposition. This approximate projector matrix is used in the place of in Algorithm 2. Next, we use the matrix inversion lemma in order to update the matrix inverse needed to calculate the sampling probability in Algorithm 2. The quality of the obtained landmarks is evaluated by calculating the error over subsets of points sampled uniformly at random. The Frobenius norm is chosen to reduce the runtime. A comparison is done with the recursive leverage score sampling method (RRLS) MuscoMusco (), which is a highly scalable method333We used the matlab code available at https://www.chrismusco.com/.. The experiments are repeated 3 times and the sampling with lowest average error over the subsets is shown for each method. An overview of the datasets and parameters used in the experiments444https://archive.ics.uci.edu/ml/datasets.php is given in Table 1. In Figure 4, the sampling of landmarks with RAS for two different kernel parameters is compared to RRLS with the same number of landmarks on the MiniBooNE dataset. The boxplots show a reduction of the error, as well as a lower variance to the advantage of RAS. The increase in performance becomes more apparent when fewer landmarks are sampled. Additional experiments are given in Figure 5 for the Adult, cod-RNA and Covertype datasets. The implementation of RAS was done in Julia1.0 while the simulations were performed on a computer with GHz processors and GB RAM. The runtimes for RRLS and RAS are only given for completeness.

(a) , .
(b) , .
Figure 4: Boxplot of the Nyström approximation error in Frobenius norm for MiniBooNE. The runtimes are: (a) s, s and (b) s, s.
(a) Adult, .
(b) cod-RNA, .
(c) Covertype, .
Figure 5: Boxplot of the Nyström approximation error in Frobenius norm. The runtimes are: (a) s, s, (b) s, s and (c) s, s.

6 Conclusion

Motivated by the connection between leverage scores, DPP’s and kernelized Christoffel functions, we propose two sampling algorithms: DAS and RAS, with a theoretical analysis. RAS allows to sample diverse and important landmarks with a good overall performance. An additional large scale approximation is proposed so that RAS can be applied to large datasets.

Acknowledgements

EU: The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program / ERC Advanced Grant E-DUALITY (787960). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: Optimization frameworks for deep kernel machines C14/18/068. Flemish Government: FWO: projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/Postdoc grant.

References

  • [1] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 1177–1184, 2007.
  • [2] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pages 682–688, 2001.
  • [3] A. Gittens and M. W. Mahoney. Revisiting the Nyström method for improved large-scale machine learning. Journal of Machine Learning Research, 17:117:1–117:65, 2016.
  • [4] P. Drineas and M. W. Mahoney. On the Nyström method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res., 6:2153–2175, December 2005.
  • [5] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In COLT Conference on Learning Theory, pages 185–209, 2013.
  • [6] A. El Alaoui and M.W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 775–783, 2015.
  • [7] C. Musco and C. Musco. Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems 30, pages 3833–3845. 2017.
  • [8] A. Rudi, L. Carratino, and L. Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems 30, pages 3888–3898. 2017.
  • [9] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco. On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems 31, pages 5677–5687. 2018.
  • [10] M.-A. Belabbas and P. J. Wolfe. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences, 106(2):369–374, 2009.
  • [11] D. Papailiopoulos, A. Kyrillidis, and C. Boutsidis. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 997–1006, New York, NY, USA, 2014. ACM.
  • [12] S. McCurdy. Ridge regression and provable deterministic ridge leverage score sampling. In Advances in Neural Information Processing Systems 31, pages 2468–2477. 2018.
  • [13] A. Farahat, A. Ghodsi, and M. Kamel. A novel greedy algorithm for Nyström approximation. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 269–277, 2011.
  • [14] I. Guyon, N. Matic, and V. Vapnik. Advances in knowledge discovery and data mining. chapter Discovering Informative Patterns and Data Cleaning, pages 181–203. American Association for Artificial Intelligence, Menlo Park, CA, USA, 1996.
  • [15] B. Gauthier and J.A.K. Suykens. Optimal quadrature-sparsification for integral operator approximation. SIAM Journal on Scientific Computing, 40(5):A3636–A3674, 2018.
  • [16] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002.
  • [17] M. Girolami. Orthogonal series density estimation and the kernel eigenvalue problem. Neural Comput., 14(3):669–688, March 2002.
  • [18] J. B. Hough, M. Krishnapur, Y Peres, and B. Virág. Determinantal processes and independence. Probab. Surveys, 3:206–229, 2006.
  • [19] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2-3):123–286, 2012.
  • [20] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning, 2011.
  • [21] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nyström with application to kernel methods. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2061–2070. JMLR.org, 2016.
  • [22] C. Li, S. Jegelka, and S. Sra. Efficient sampling for k-determinantal point processes. In AISTAT, 2016.
  • [23] N. Tremblay, S. Barthelmé, and P.-O. Amblard. Determinantal point processes for coresets, arxiv preprint arxiv:1803.08700.
  • [24] A. Askari, F. Yang, and L. El Ghaoui. Kernel-based outlier detection using the inverse christoffel function. CoRR, abs/1806.06775, 2018.
  • [25] E. Pauwels, F. Bach, and J.P. Vert. Relating Leverage Scores and Density using Regularized Christoffel Functions. In Advances in Neural Information Processing Systems 32, pages 1663–1672, 2018.
  • [26] J.B. Lasserre and E. Pauwels. The empirical Christoffel function with applications in Machine Learning, Arxiv preprint arXiv:1701.02886.
  • [27] A. Rudi, E. De Vito, A. Verri, and F. Odone. Regularized kernel algorithms for support estimation. Frontiers in Applied Mathematics and Statistics, 3:23, 2017.
  • [28] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, 2006.
  • [29] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005.
  • [30] R. DeVore, G. Petrova, and P. Wojtaszczyk. Greedy algorithms for reduced bases in Banach spaces. Constructive Approximation, 37(3):455–466, Jun 2013.
  • [31] T. Gao, S. Kovalsky, and I. Daubechies. Gaussian process landmarking on manifolds. SIAM Journal on Mathematics of Data Science, 1(1):208–236, 2019.
  • [32] J. Altschuler, F. Bach, A. Rudi, and J. Weed. Massively scalable Sinkhorn distances via the Nyström method, arxiv preprint arxiv:1812.05189. 2018.
  • [33] M. Belkin. Approximation beats concentration? an approximation view on inference with smooth radial kernels. In Conference On Learning Theory, COLT, pages 1348–1361, 2018.
  • [34] M. B. Cohen, C. Musco, and J. Pachocki. Online row sampling. In Proceedings of the 19th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX), 2015.
  • [35] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth. On the Lambert W Function. In Advances in computational mathematics, pages 329–359, 1996.
  • [36] S. Minsker. On some extensions of bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111 – 119, 2017.
  • [37] J. Tropp. Freedman’s inequality for matrix martingales. Electron. Commun. Probab., 16:262–270, 2011.
  • [38] S. Paul, M. Magdon-Ismail, and P. Drineas. Column selection via adaptive sampling. In Advances in Neural Information Processing Systems 28, pages 406–414, 2015.
  • [39] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-carlo Algorithms for Finding Low-rank Approximations. J. ACM, 51(6), November 2004.
  • [40] P. Binev, A. Cohen, W. Dahmen, R. DeVore, G. Petrova, and P. Wojtaszczyk. Convergence rates for greedy algorithms in reduced basis methods. SIAM Journal on Mathematical Analysis, 43(3):1457–1472, 2011.
  • [41] A. Pinkus. Matrices and -widths. Linear Algebra and Applications, 27:245–278, 1979.

Appendix A Christoffel function and orthogonal polynomials

For completeness, we simply recall here the definition of the Christoffel function [25]. Let and be an integrable real function on . Then, the Christoffel function reads:

where is the set of variate polynomials of degree at most . The definition (1) involves a minimization over a RKHS and includes an additional regularization term.

Appendix B Proofs of the results related to Christoffel functions

b.1 A technical result

Lemma 5.

Let a Hilbert space and let and a collection of vectors. The solution of:

(10)

is unique and given by where .

Proof.

The objective is strongly convex and the solution is unique. By using Lagrange multipliers, we find that . ∎

b.2 Proofs

Proof of Proposition 1.

Firstly, since , the objective (2) and the constraints depend only on the discrete set . Therefore, we can decompose where and . Since by construction for all , the objective satisfies:

Hence, we can rather minimize on . This means that we can find such that . The objective (2) reads

The problem becomes then:

where and is a column of the matrix , while is the -th element of the canonical basis of . Recall that, since we assumed , then is invertible. By doing the change of variables , we obtain:

with and . Let and be the matrix whose columns are the vectors . Notice that the elements of are linearly independent since .

Then, we use Lemma 5, and find . By using with the Gram matrix , we find:

(11)

By substituting , we find that for all . Hence, we find:

which yields the desired result.

The optimal value of (2) is attained at with:

Proof of Proposition 2.

By using Lemma 5 or the results of the paper [25], we find:

where the weights are defined as for all and otherwise. Then, we notice that where is a sampling matrix obtained by sampling the columns of the identity matrix belonging to . The latter identity yields:

where we used Lemma 2.

Proof of Proposition 3.

The objective can be simply obtained by the following determinant:

Proof of Corollary 1.

Let . Considering (11) in the proof of Proposition 1, we see that and therefore . Thanks to this property, we have then the inequality for all . ∎

Proof of Lemma 1.

Firstly, we have:

Then, we can obtain a similar expression for the projector kernels:

where we substituted the definition . Next, we use , and the property that if with and . We obtain:

and the result follows from:

Proof of Lemma 4.

For simplicity let . By assumption

Then, we have:

Now, by using Lemma 3, we find:

Appendix C Proof of convergence of the deterministic algorithm

Let us explain how Algorithm 1 can be rephrased by considering first Proposition 3. First, we recall the spectral factorization of the projector kernel matrix . For simplicity, denote the Hilbert space . We also introduce the notation for the set of columns of which is considered as a compact subset of . The vector subspace generated by the columns indexed by the subset is denoted by With these notations and thanks to Proposition 3, the square root of the objective maximized by Algorithm 1 can be written . Hence, by using the definition of the Christoffel function, we easily see that Then, the algorithm can be viewed as a greedy reduced subspace method, namely: