Randomized Clustered Nyström for Large-Scale Kernel Machines
The Nyström method has been popular for generating the low-rank approximation of kernel matrices that arise in many machine learning problems. The approximation quality of the Nyström method depends crucially on the number of selected landmark points and the selection procedure. In this paper, we present a novel algorithm to compute the optimal Nyström low-approximation when the number of landmark points exceed the target rank. Moreover, we introduce a randomized algorithm for generating landmark points that is scalable to large-scale data sets. The proposed method performs K-means clustering on low-dimensional random projections of a data set and, thus, leads to significant savings for high-dimensional data sets. Our theoretical results characterize the tradeoffs between the accuracy and efficiency of our proposed method. Extensive experiments demonstrate the competitive performance as well as the efficiency of our proposed method.
Keywords: Kernel methods, Nyström method, Low-rank approximation, Random projections, Large-scale learning
Kernel machines have been widely used in various machine learning problems such as classification, clustering, and regression. In kernel-based learning, the input data points are mapped to a high-dimensional feature space and the pairwise inner products in the lifted space are computed and stored in a positive semidefinite kernel matrix . The lifted representation may lead to better performance of the learning problem, but a drawback is the need to store and manipulate a large kernel matrix of size , where is the size of data set. Thus a kernel machine has quadratic space complexity and quadratic or cubic computational complexity (depending on the specific type of machine).
One promising strategy for reducing these costs consists of a low-rank approximation of the kernel matrix , where for a target rank . Such low-rank approximations can be used to reduce the memory and computation cost by trading-off accuracy for scalability. For this reason, much research has focused on efficient algorithms for computing low-rank approximations, e.g., (Fine and Scheinberg, 2001; Bach and Jordan, 2002, 2005; Halko et al., 2011). The Nyström method is probably one of the most well-studied and successful methods that has been used to scale up several kernel methods (Kumar et al., 2009; Sun et al., 2015). The Nyström method works by selecting a small set of bases referred to as “landmark points” and computing the kernel similarities between the input data points and landmark points. Therefore, the performance of the Nyström method depends crucially on the number of selected landmark points as well as the procedure according to which these landmark points are selected.
The original Nyström method, first introduced to the kernel machine setting by Williams and Seeger (2001), proposed to select landmark points uniformly at random from the set of input data points. More recently, several other probabilistic strategies have been proposed to provide informative landmark points in the Nyström method, including sampling with weights proportional to column norms (Drineas et al., 2006), diagonal entries (Drineas and Mahoney, 2005), and leverage scores (Gittens and Mahoney, 2013). Zhang and Kwok (2010) proposed a non-probabilistic technique for generating landmark points using centroids resulting from K-means clustering on the input data points. The proposed “Clustered Nyström method” shows the Nyström approximation error is related to the encoding power of landmark points in summarizing data and it provides improved accuracy over other sampling methods such as uniform and column-norm sampling (Kumar et al., 2012). However, the main drawback of this method is the high memory and computational complexity associated with performing K-means clustering on high-dimensional large-scale data sets.
The aim of this paper is to improve the accuracy and efficiency of the Nyström method in two directions. We present a novel algorithm to compute the optimal rank- approximation in the Nyström method when the number of landmark points exceed the rank parameter . In fact, our proposed method can be used within all landmark selection procedures to compute the best rank- approximation achievable by a chosen set of landmark points. Moreover, we present an efficient method for landmark selection which provides a tunable tradeoff between the accuracy of low-rank approximations and memory/computation requirements. Our proposed “Randomized Clustered Nyström method” generates a set of landmark points based on low-dimensional random projections of the input data points (Achlioptas, 2003).
In more detail, our main contributions are threefold.
It is common to select more landmark points than the target rank to obtain high quality Nyström low-rank approximations. In Section 4, we present a novel algorithm with theoretical analysis for computing the optimal rank- approximation when the number of landmark points exceed the target rank . Thus, our proposed method, called “Nyström via QR Decomposition,” can be used with any landmark selection algorithm to find the best rank- approximation for a given set of landmark points. We also provide intuitive and real-world examples to show the superior performance and efficiency of our method in Section 4.
Second, we present a random-projection-type landmark selection algorithm which easily scales to large-scale high-dimensional data sets. Our proposed “Randomized Clustered Nyström method” presented in Section 5 performs the K-means clustering algorithm on the random projections of input data points and it requires only two passes over the original data set. Thus our method leads to significant memory and computation savings in comparison with the Clustered Nyström method. Moreover, our theoretical results (Theorem 2) show that the proposed method produces low-rank approximations with little loss in accuracy compared to Clustered Nyström with high probability.
Third, we present extensive numerical experiments comparing our Randomized Clustered Nyström method with a few other sampling methods on two tasks: (1) low-rank approximation of kernel matrices and (2) kernel ridge regression. In Section 6, we consider six data sets from the LIBSVM archive (Chang and Lin, 2011) with dimensionality up to .
2 Notation and Preliminaries
We denote column vectors with lower-case bold letters and matrices with upper-case bold letters. is the identity matrix of size ; is the matrix of zeros. For a vector , let denote the Euclidean norm, and represents a diagonal matrix with the elements of on the main diagonal. The Frobenius norm for a matrix is defined as , where represents the -th entry of , is the transpose of , and is the trace operator.
Let be a symmetric positive semidefinite (SPSD) matrix with . The singular value decomposition (SVD) or eigenvalue decomposition of can be written as , where contains the orthonormal eigenvectors, i.e., , and is a diagonal matrix which contains the eigenvalues of in descending order, i.e., . The matrices and can be decomposed for a target rank ():
where contains the leading eigenvalues and the columns of span the top -dimensional eigenspace, and and contain the remaining eigenvalues and eigenvectors. It is well-known that is the “best rank- approximation” to in the sense that minimizes over all matrices of rank at most (Eckart and Young, 1936) and we have . If , then is not unique, so we write to mean any matrix satisfying Equation 1. The pseudo-inverse of can be obtained from the SVD or eigenvalue decomposition as . When is full rank, we have .
Another matrix factorization technique that we use in this paper is the QR decomposition. An matrix , with , can be decomposed as a product of two matrices , where has orthonormal columns, i.e., , and is an upper triangular matrix. Sometimes this is called the thin QR decomposition, to distinguish it from a full QR decomposition which finds and zero-pads accordingly.
3 Background and Related Work
Kernel methods have been successfully applied to a variety of machine learning problems such as classification and regression. Well-known examples include support vector machines (SVM) (Cortes and Vapnik, 1995), kernel principal component analysis (KPCA) (Schölkopf et al., 1998), kernel ridge regression (Shawe-Taylor and Cristianini, 2004), kernel clustering (Girolami, 2002), and kernel dictionary learning (Van Nguyen et al., 2013). The main idea behind kernel-based learning is to map the input data points into a feature space, where all pairwise inner products of the mapped data points can be computed via a nonlinear kernel function that satisfies Mercer’s condition (Aronszajn, 1950; Schölkopf and Smola, 2001). Thus, kernel methods allow one to use linear algorithms in the higher (or infinite) dimensional feature space which correspond to nonlinear algorithms in the original space. For this reason, kernel machines have received much attention as an effective tool to tackle problems with complex and nonlinear structures.
Let be a data matrix that contains data points in as its columns. The inner products in feature space are calculated using a “kernel function” defined on the original space:
where is the kernel-induced feature map. All pairwise inner products of the mapped data points are stored in the so-called “kernel matrix” , where the -th entry is . Two well-known examples of kernel functions that lead to symmetric positive semidefinite (SPSD) kernel matrices are Gaussian and polynomial kernel functions. The former takes the form and the polynomial kernel is of the form , where and are the parameters (Van Nguyen et al., 2012; Pourkamali-Anaraki and Hughes, 2013). Moreover, combinations of multiple kernels can be constructed to tackle problems with complex and heterogeneous data sources (Bach et al., 2004; Gönen and Alpaydın, 2011; Liu et al., 2016).
Despite the simplicity of kernel machines in nonlinear representation of data, one prominent problem is the calculation, storage, and manipulation of the kernel matrix for large-scale data sets. The cost to form using standard kernel functions is and it takes memory to store the full kernel matrix. Thus, both memory and computation cost scale as the square of the number of data points. Moreover, subsequent processing of the kernel matrix within the learning process is computationally quite expensive. For example, algorithms such as KPCA and kernel dictionary learning compute the eigenvalue decomposition of the kernel matrix, where the standard techniques take time and multiple passes over will be required. In other kernel-based learning methods such as kernel ridge regression, the inverse of the kernel matrix , where is a regularization parameter, must be computed which requires time (Cortes et al., 2010; Alaoui and Mahoney, 2015). Thus, large-scale data sets have provided a considerable challenge to the design of efficient kernel-based learning algorithms (Slavakis et al., 2014; Hsieh et al., 2014).
A well-studied approach to reduce the memory and computation burden associated with kernel machines is to use a low-rank approximation of kernel matrices. This approach utilizes the decaying spectra of kernel matrices and the best rank- approximation is computed, cf. Equation 1. Since is SPSD, the eigenvalue decomposition can be used to express a low-rank approximation in the form of:
The benefits of this low-rank approximation are twofold. First, it takes to store the matrix which is only linear in the data set size . The reduction of memory requirements from quadratic to linear results in significant memory savings. Second, the low-rank approximation leads to substantial computational savings within the learning process. For example, the following matrix inversion arising in algorithms such as kernel ridge regression can be calculated using the Sherman-Morrison-Woodbury formula:
Here, we only need to invert a much smaller matrix of size just . Thus, the computation cost is to compute and the matrix inversion in Equation 2.
Another example of computation savings is the “linearization” of kernel methods using the low-rank approximation, where linear algorithms are applied to the rows of . In this case, the matrix serves as an empirical kernel map and the rows of are known as virtual samples. This strategy has been shown to speed up various kernel-based learning methods such as SVM, kernel dictionary learning, and kernel clustering (Zhang et al., 2012; Golts and Elad, 2016; Pourkamali-Anaraki and Becker, 2016).
While the low-rank approximation of kernel matrices is a promising approach to reduce the memory and computational complexity, the main bottleneck is the computation of the full kernel matrix and the best rank- approximation . Standard algorithms for computing the eigenvalue decomposition of take time. Partial eigenvalue decomposition, e.g., Krylov subspace method, can be performed to find the leading eigenvalues/eigenvectors. However, these techniques require at least passes over the entire kernel matrix which is prohibitive for large dense matrices (Halko et al., 2011).
To address this problem, much recent work has focused on efficient randomized methods to compute low-rank approximations of large matrices (Mahoney, 2011). The Nyström method is one of the few randomized approximation techniques that does not need to first compute the entire kernel matrix. The standard Nyström method was first introduced (in the context of matrix kernel approximation) in (Williams and Seeger, 2001) and is based on sampling a small subset of input data columns, after which the kernel similarities between the small subset and input data points are computed to construct a rank- approximation. Section 3.1 discusses in detail the Nyström method and its extension which finds the approximate eigenvalue decomposition of the kernel matrix.
Since the sampling technique is a key aspect of the Nyström method, much research has focused on selecting the most informative subset of input data to improve the approximation accuracy and thus the performance of kernel-based learning methods (Kumar et al., 2012). An overview of different sampling techniques, including the Clustered Nyström method, is presented in Section 3.2.
3.1 The Nyström Method
The Nyström method for generating a low-rank approximation of the SPSD kernel matrix works by selecting a small set of bases referred to as “landmark points”. For example, the simplest and most common technique to select the landmark points is based on uniform sampling without replacement from the set of all input data points (Williams and Seeger, 2001). In this section, we explain the Nyström method for a given set of landmark points regardless of the sampling mechanism.
Let be the set of landmark points in . The Nyström method first constructs two matrices and , where and . Next, it uses both and to construct a low-rank approximation of the kernel matrix :
For the rank-restricted case, the Nyström method generates a rank- approximation of the kernel matrix, , by computing the best rank- approximation of the inner matrix (Kumar et al., 2012, 2009; Sun et al., 2015; Li et al., 2015; Wang and Zhang, 2013):
where represents the pseudo-inverse of . Thus, the eigenvalue decomposition of the matrix should be computed to find the top eigenvalues and corresponding eigenvectors. Let and contain the top eigenvalues and the corresponding orthonormal eigenvectors of , respectively. Then, the rank- approximation in Equation 3 can be expressed as:
The time complexity of the Nyström method to form is , where it takes to construct matrices and . Also, it takes time to perform the partial eigenvalue decomposition of and represents the cost of matrix multiplication . Thus, for , the computation cost to form the low-rank approximation of the kernel matrix, , is only linear in the data set size .
In practice, there exist two approaches to obtain the approximate eigenvalue decomposition of the kernel matrix in the Nyström method. The first approach is based on the exact eigenvalue decomposition of to get the following estimates of the leading eigenvalues and eigenvectors of (Kumar et al., 2012):
These estimates of eigenvalues/eigenvectors are naive since it is easy to show that the estimated eigenvectors are not guaranteed to be orthonormal, i.e., . Moreover, the factor in Equation 5 is used to roughly compensate for the small size of the matrix compared to the kernel matrix. Thus, the accuracy of this approach depends heavily on the data set and the selected landmark points.
The second approach provides more accurate estimates of eigenvalues and eigenvectors of by using the low-rank approximation in Equation 4, and in fact this approach provides the exact eigenvalue decomposition of . The first step is to find the exact eigenvalue decomposition of the matrix:
where . Then, the estimates of leading eigenvalues and eigenvectors of are obtained as follows (Zhang and Kwok, 2010):
For this case, the resultant eigenvectors are orthonormal:
where this comes from the fact that contains orthonormal eigenvectors and . The overall procedure to estimate the leading eigenvalues/eigenvectors based on the Nyström method is summarized in Algorithm 1. The time complexity of the approximate eigenvalue decomposition is , in addition to the cost of computing mentioned earlier.
3.2 Sampling Techniques for the Nyström Method
The importance of landmark points in the Nyström method has driven much recent work into various probabilistic and deterministic sampling techniques to improve the accuracy of Nyström-based approximations (Kumar et al., 2012; Sun et al., 2015). In this section, we review a few popular sampling methods in the literature.
The simplest and most common sampling method proposed originally by Williams and Seeger (2001) was uniform sampling without replacement. In this case, each data point in the data set is sampled with the same probability, i.e., , for . The advantage of this technique is the low computational complexity associated with sampling landmark points. However, it has been shown that uniform sampling does not take into account the nonuniform structure of many data sets. Therefore, sampling mechanisms based on nonuniform distributions have been proposed to address this problem. Two such examples include: (1) “Column-norm sampling” (Drineas et al., 2006), where columns of the kernel matrix are sampled with weights proportional to the norm of columns of (not of the data matrix ), i.e., , and (2) “diagonal sampling” (Drineas and Mahoney, 2005), where the weights are proportional to the corresponding diagonal elements, i.e., . The former requires time and space to find the nonuniform distribution, while the latter requires time and space. The column-norm sampling method requires computing the entire kernel matrix , which negates one of the principal benefits of the Nyström method. The diagonal sampling method reduces to the uniform sampling for shift-invariant kernels, such as the Gaussian kernel function, since for all . Recently, Gittens and Mahoney (2013) have studied both empirical and theoretical aspects of uniform and nonuniform sampling on the accuracy of Nyström-based low-rank approximations.
The “Clustered Nyström method” proposed by Zhang and Kwok (2010); Zhang et al. (2008) is a popular non-probabilistic approach that uses out-of-sample extensions to select informative landmark points. The key observation of their work is that the Nyström low-rank approximation error depends on the quantization error of encoding the entire data set with the landmark points. For this reason, the Clustered Nyström method sets the landmark points to be the centroids found from K-means clustering. In machine learning and pattern recognition, K-means clustering (Bishop, 2006) is a well-established technique to partition a data set into clusters by trying to minimize the total sum of the squared Euclidean distances of each point to the closest cluster center.
To present the main result of Clustered Nyström method, we first explain K-means clustering briefly. Given , an -partition of this data set is a collection of disjoint and nonempty sets (each representing a cluster) such that their union covers the entire data set. Each cluster can be defined by a cluster center, which is the sample mean of data points in that cluster. Thus, the goal of K-means clustering is to minimize the following:
where represents the centroid of the cluster to which the data point is assigned, and hence depends on . The optimal clustering is the solution of following NP-hard optimization problem (Bishop, 2006):
In practice, Lloyd’s algorithm (Lloyd, 1982), also known as the K-means clustering algorithm, is used to solve the optimization problem in Equation 6. The K-means clustering algorithm is an iterative procedure which consists of two steps: (1) data points are assigned to the nearest cluster centers, and (2) the cluster centers are updated based on the most recent assignment of the data points. The objective function decreases at every step, and so the procedure is guaranteed to terminate since there are only finitely many partitions. Typically, only a few iterations are needed to converge to a locally optimal solution. The quality of clustering can be improved by using well-chosen initialization, such as K-means++ initialization (Arthur and Vassilvitskii, 2007).
Now, we present the result of the Clustered Nyström method which relates the Nyström approximation error (in terms of the Frobenius norm) to the quantization error induced by encoding the data set with landmark points (Zhang and Kwok, 2010).
Proposition 1 (Clustered Nyström Method)
Assume that the kernel function satisfies the following property:
where is a constant depending on . Consider the data set and the landmark set which partitions the data set into clusters . Let denote the closest landmark point to each data point :
Consider the kernel matrix , , and the Nyström approximation , where and . The approximation error in terms of the Frobenius norm is upper bounded:
where and are two constants and is the total quantization error of encoding each data point with the closest landmark point :
In (Zhang and Kwok, 2010), it is shown that for a number of widely used kernel functions, e.g., linear, polynomial, and Gaussian, the property in Equation 7 is satisfied. Based on Proposition 1, the Clustered Nyström method tries to minimize the total quantization error in Equation 9—and thus the Nyström approximation error—by performing the K-means algorithm on the data points . The resulting cluster centers are then chosen as the landmark points to construct matrices and and generate the low-rank approximation . One benefit of the approach is that the full kernel matrix is never formed.
4 Improved Nyström Approximation via QR Decomposition
In Section 3.1, we explained the Nyström method to compute rank- approximations of SPSD kernel matrices based on a set of landmark points. For a data set of size and a small set of landmark points (), two matrices and are constructed to form the low-rank approximation of : , where .
Although the final goal is to find an approximation that has rank no greater than , it is often preferred to select landmark points and then restrict the resultant approximation to have rank at most , e.g., (Sun et al., 2015; Li et al., 2015; Wang and Zhang, 2013). The main intuition is that selecting landmark points and then restricting the approximation to a lower rank- space has a regularization effect which can lead to more accurate approximations (Gittens and Mahoney, 2013). For example, Proposition 1 states that the approximation error is a function of the total quantization error induced by encoding data points with the set of landmark points. Obviously, the more landmark points are selected, the total quantization error becomes smaller and thus the quality of rank- approximation can be improved. Therefore, it is important to use an efficient and accurate method to restrict the matrix to have rank at most .
In the standard Nyström method presented in Algorithm 1, the rank of matrix is restricted by computing the best rank- approximation of the inner matrix : . Since the inner matrix in the representation of has rank no greater than , it follows that has rank at most . The main benefit of this technique is the low computational cost of performing an exact eigenvalue decomposition or SVD on a relatively small matrix of size . However, the standard Nyström method totally ignores the structure of the matrix and is solely based on “filtering” . In fact, since the rank- approximation does not utilize the full knowledge of matrix , the selection of more landmark points does not guarantee an improved low-rank approximation in the standard Nyström method.
To solve this problem, we present an efficient method to compute the best rank- approximation of the matrix , for given matrices and . In contrast with the standard Nyström method, our proposed approach takes advantage of both matrices and . To begin, let us consider the best rank- approximation of the matrix :
where (a) follows from the QR decomposition of ; , where and . To get (b), the eigenvalue decomposition of the matrix is computed, , where the diagonal matrix contains eigenvalues in descending order on the main diagonal and the columns of are the corresponding eigenvectors. Moreover, we note that the columns of are orthonormal because both and have orthonormal columns:
Thus, the decomposition contains the eigenvalues and orthonormal eigenvectors of the Nyström approximation . Based on the Eckart-Young theorem, the best rank- approximation of is then computed using the leading eigenvalues and corresponding eigenvectors , as given in Equation 10. Thus, the estimates of the top eigenvalues and eigenvectors of the kernel matrix from the Nyström approximation are obtained as follows:
These estimates can also be used to approximate the kernel matrix as , where .
The overall procedure to estimate the leading eigenvalues/eigenvectors of the kernel matrix based on a set of landmark points , , is presented in Algorithm 2. The time complexity of this method is , where represents the cost to form matrices and . The complexity of the QR decomposition is and it takes time to compute the eigenvalue decomposition of . Finally, the cost to compute the matrix multiplication is .
We can compare the computational complexity of our proposed Nyström method via QR decomposition (Algorithm 2) with that of the standard Nyström method (Algorithm 1). Since our focus in this paper is on large-scale data sets with large, we only consider terms involving which lead to dominant computation costs. Based on the discussion in Section 3.1, it takes time to compute the eigenvalue decomposition using the standard Nyström method. For our proposed method, the cost of eigenvalue decomposition is . Thus, for data of even moderate dimension with , the dominant term in both and is . This means that the increase in computation cost of our method ( vs. ) becomes less significant when the number of landmark points is close to the target rank .
In the rest of this section, we compare the performance and efficiency of our proposed method presented in Algorithm 2 with Algorithm 1 on three examples. As we will see, our proposed method yields more accurate decompositions than the standard Nyström method for small values of , such as .
4.1 Toy Example
It is always true that for any kernel matrix , (this is also true in the spectral norm), due to the best-approximation properties of our estimator. We can show, using examples, that this inequality can be quite large.
In the first example, we consider a small kernel matrix of size :
Such a matrix could arise, for example, using the polynomial kernel with parameters and and the data matrix:
Here, the goal is to compute the rank approximation of . Suppose that columns of the kernel matrix are sampled uniformly, e.g., the first and second columns. Then, we have:
In the standard Nyström method, the best rank- approximation of the inner matrix is first computed111One might ask if it is better to first find and then find the best rank- approximation of . This generally does not help, and one can construct similar toy examples where this approach does arbitrarily poorly as well.. Then, based on Equation 3, the rank- approximation of the kernel matrix in the standard Nyström method is given by:
The normalized kernel approximation error in terms of the Frobenius norm is large: . On the other hand, using the same matrices and , our proposed method first computes the QR decomposition of :
Then, the product of three matrices is computed to find its eigenvalue decomposition :
Finally, the rank- approximation of the kernel matrix in our proposed method is obtained by using Equation 10:
where . In fact, one can show that our approximation is the same as the best rank- approximation formed using full knowledge of , i.e., . Furthermore, clearly we can tweak this toy example to make the error and for any . This example demonstrates that “Nyström via QR Decomposition” produces much more accurate rank- approximation of the kernel matrix with same matrices and used in the standard Nyström method.
4.2 Synthetic Data Set
As shown in Figure (a)a, we consider a synthetic data set consisting of data points in that are nonlinearly separable. Therefore, a nonlinear kernel function is employed to find an embedding of these points so that linear learning algorithms can be applied to the mapped data points. To do this, we use the polynomial kernel function with the degree and the constant , i.e., . Next, a low-rank approximation of the kernel matrix in the form of , , is computed by using the Nyström method. The rows of represent the virtual samples or mapped data points (Zhang et al., 2012; Golts and Elad, 2016; Pourkamali-Anaraki and Becker, 2016). Given a suitable kernel function and accurate low-rank approximation technique, the rows of in are linearly separable. In this example, we set the target rank so that we can easily visualize the resultant mappings.
We measure the approximation accuracy by using the normalized kernel approximation error defined as , where the matrix is obtained by using the standard Nyström method and our proposed method “Nyström via QR Decomposition”. In Figure (b)b, the mean and standard deviation of the normalized kernel approximation error over trials for varying number of landmark points are reported. In each trial, the landmark points are chosen uniformly at random without replacement from the input data. Both our method and the standard Nyström method share same matrices and for a fair comparison. As we expect, the accuracy of our Nyström via QR decomposition is exactly the same as the standard Nyström method for . As the number of landmark points increases, the accuracy of standard Nyström method improves and it slowly gets closer to the accuracy of exact eigenvalue decomposition or SVD. However, our proposed method reaches the accuracy of SVD even for . In fact, we observe that the approximation error of our method by using landmark points is better than the accuracy of standard Nyström method with . For this example, our proposed rank- approximation technique in Algorithm 2 is more accurate and memory efficient than the standard Nyström method with at least one order of magnitude savings in memory.
Finally, we visualize the mapped data points using both methods for fixed . In Figure (c)c and Figure (d)d, the rows of and are plotted, respectively. The rows of in the “Nyström via QR Decomposition” method are linearly separable which is desirable for kernel-based learning. But, the rows of are not linearly separable due to the poor performance of the standard Nyström method.
4.3 Real Data Set: satimage
In the last example, we use the satimage data set (Chang and Lin, 2011) with and . We duplicate each data point four times to increase to in order to have a more meaningful comparison of computation times. The kernel matrix is formed using the Gaussian kernel function where the parameter is chosen as the averaged squared distance between all the data points and the sample mean (Zhang and Kwok, 2010). The landmark points are chosen by performing K-means on the original data, following the Clustered Nyström method.
In Figure (a)a and Figure (c)c, the mean and standard deviation of normalized kernel approximation error are reported over trials for varying number of landmark points and two values of the target rank and , respectively. As expected, when the number of landmark points is set to be the same as the target rank, the standard Nyström method and our proposed method have exactly the same approximation error. Interestingly, it is seen that when the number of landmark points increases, the approximation error does not necessarily decrease in the standard Nyström method as shown in Figure (a)a. This is a major drawback of the standard Nyström method because the increase in memory and computation costs imposed by larger may lead to worse performance. In contrast, our proposed “Nyström via QR Decomposition” outperforms the standard Nyström method for both values of the target rank and , and we know theoretically that performance can only improve as increases. Moreover, we see that the accuracy of our method reaches the accuracy of the best rank- approximation obtained by using the SVD for as few as landmark points.
The runtime of both methods are also compared in Figure (b)b and Figure (d)d for two cases of and , respectively. The reported values are averaged over trials and they represent the computation cost associated with Algorithm 1 and Algorithm 2. As we explained earlier in this section, the computational complexity of our method will be slightly increased compared to the standard Nyström method and this is consistent with the timing results in Figure (b)b and Figure (d)d. Moreover, we see that the runtime of our method is increased by almost a factor of even for large values of . To have a fair comparison, we draw a dashed green line that determines the values of for which both methods have the same running time. In Figure (b)b, the runtime for in our method is the same as in the standard Nyström, while our method is much more accurate. Similarly, in Figure (d)d, the runtime for in our method is almost the same as in the standard Nyström method. However, our method results in more accurate low-rank approximation of the kernel matrix. This dataset further supports that our “Nyström via QR Decomposition” results in more accurate low-rank approximations than the standard Nyström method with significant memory and computation savings.
5 Randomized Clustered Nyström Method
The selection of informative landmark points is an essential component to obtain accurate low-rank approximations of SPSD matrices in the Nyström method. The Clustered Nyström method (Zhang and Kwok, 2010) has been shown to be a powerful technique for generating highly accurate low-rank approximations compared to uniform sampling and other sampling methods (Kumar et al., 2012; Sun et al., 2015; Iosifidis and Gabbouj, 2016). However, the main drawbacks of this method are high memory and computational complexities associated with performing K-means clustering on large-scale data sets. In this section, we introduce an efficient randomized method for generating a set of representative landmark points based on low-dimensional random projections of the original data. Specifically, our proposed method provides a “tunable tradeoff” between the accuracy of Nyström low-rank approximations and the efficiency in terms of memory and computation savings.
To introduce our proposed method, we begin by explaining the process of generating landmark points in the Clustered Nyström method. As mentioned in Section 3.2, the central idea behind Clustered Nyström is that the approximation error depends on the total quantization error of encoding each data point in the data set with the closest landmark point. Thus, landmark points are chosen to be centroids resulting from the K-means clustering algorithm which partitions the data set into clusters. Given an initial set of centroids , the K-means clustering algorithm iteratively updates assignments and cluster centroids as follows (Bishop, 2006):
Update assignments: for
Update cluster centroids: for
where denotes the number of data points in the cluster and is the sample mean of the -th cluster.
For large-scale data sets with large and/or , the memory requirements and computation cost of performing the K-means clustering algorithm become expensive (Ailon et al., 2009; Shindler et al., 2011; Feldman et al., 2013). First, the K-means algorithm requires several passes on the entire data set and thus the data set should often be stored in a centralized location which takes memory. Second, the time complexity of K-means clustering is per iteration to partition the set of data points into clusters (Traganitis et al., 2015). Hence, the high dimensionality of massive data sets provides considerable challenge to the design of memory and computation efficient alternatives for the Clustered Nyström method.
One promising strategy to address these obstacles is to use random projections of the data for constructing a small set of new features (Achlioptas, 2003; Pourkamali-Anaraki and Hughes, 2014; Zhang et al., 2014; Pourkamali-Anaraki et al., 2015). In this case, for some parameter , the data matrix is multiplied on the left by a random zero-mean matrix in order to compute a low-dimensional representation:
The columns of are known as sketches or compressive measurements (Davenport et al., 2010) and the random map preserves the geometry of data under certain conditions (Tropp, 2011). The task of clustering is then performed on these low-dimensional data points by minimizing , which partitions the data points in the reduced space into clusters. After finding the partition in the reduced space, the same partition is used on the original data points and the cluster centroids in the original space are calculated using Equation 12 at computational cost .
In this paper, we introduce a random-projection-type Clustered Nyström method, called “Randomized Clustered Nyström,” for generating landmark points. In the first step of our method, a random sign matrix whose entries are independent realizations of Bernoulli random variables is constructed:
Next, the product is computed to find the low-dimensional sketches . The standard implementation of matrix multiplication costs . The matrix multiplication can also be performed in parallel which leads to noticeable accelerations in practice (Halko et al., 2011). Moreover, it is possible to use the mailman algorithm (Liberty and Zucker, 2009) which takes advantage of the binary-nature of to further speed up the matrix multiplication. In our experiments, we use Intel MKL BLAS version 11.2.3 which is bundled with MATLAB, which we found to be sufficiently optimized and does not form a bottleneck in the computational cost.
In the second step, the K-means clustering algorithm is performed on the projected low-dimensional data to partition the data set:
where is the resulting -partition. We cannot guarantee that K-means returns the globally optimal partition as the problem is NP-hard (Dasgupta, 2008) but seeding using K-means++ (Arthur and Vassilvitskii, 2007) guarantees a partition with expected objective within a factor of the optimal one, and other variants of K-means, under mild assumptions (Ostrovsky et al., 2012), can either efficiently guarantee a solution within a constant factor of optimal, or guarantee solutions arbitrarily close to optimal, so-called polynomial-time approximation schemes (PTAS). Lastly, the landmark points are generated by computing the sample mean of data points:
The proposed “Randomized Clustered Nyström” method is summarized in Algorithm 3. In our method, the “compression factor” is defined as the ratio of parameter to the ambient dimension , i.e., . Regarding the memory complexity, our method requires only two passes on the data set , the first to compute the low-dimensional sketches (step 3), and the second for the sample mean (step 5). In fact, our Randomized Clustered Nyström only stores the low-dimensional sketches which takes space, whereas the Clustered Nyström method has memory complexity of , meaning our method reduces the memory complexity by a factor of . In terms of time complexity, the computation cost of K-means on the dimension-reduced data in our method is per iteration compared to the cost in the Clustered Nyström method, so the speedup is up to (the exact amount depends on the number of iterations, since we must amortize the cost of the one-time matrix multiply ).
Thus, our proposed method for generating landmark points provides a tunable parameter to reduce the memory and computation cost of the Clustered Nyström method. Next, we study and characterize the “tradeoffs” between accuracy of low-rank approximations and the memory/computation savings in our proposed method. In particular, the following theorem presents an error bound on the Nyström low-rank approximation for a set of landmark points generated via our Randomized Clustered Nyström method (Algorithm 3).
Theorem 2 (Randomized Clustered Nyström Method)
Assume that the kernel function satisfies Equation 7. Consider the data set and the kernel matrix with entries . The optimal partitioning of