Scale Up Nonlinear Component Analysis with
Doubly Stochastic Gradients
Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they can not scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points.
We propose a simple, computationally efficient, and memory friendly algorithm based on the “doubly stochastic gradients” to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the non-convex nature of these problems, our method enjoys theoretical guarantees that it converges at the rate to the global optimum, even for the top eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.
Scaling up nonlinear component analysis has been challenging due to prohibitive computation and memory requirements. Recently, methods such as Randomized Component Analysis  are able to scale to larger datasets by leveraging random feature approximation. Such methods approximate the kernel function by using explicit random feature mappings, then perform subsequent steps in the primal form, resulting in linear computational complexity. Nonetheless, theoretical analysis [23, 16] shows that in order to get high quality results, the number of random features should grow linearly with the number of data points. Experimentally, one often sees that the statistical performance of the algorithm improves as one increases the number of random features.
Another approach to scale up the kernel component analysis is to use stochastic gradient descent and online updates [19, 20]. These stochastic methods have also been extended to the kernel case [13, 6, 11]. They require much less computation than their batch counterpart, converge in rate, and are naturally applicable to streaming data setting. Despite that, they share a severe drawback: all data points used in the updates need to be saved, rendering them impractical for large datasets.
In this paper, we propose to use the “doubly stochastic gradients” for nonlinear component analysis. This technique is a general framework for scaling up kernel methods  for convex problems and has been successfully applied to many popular kernel machines such as kernel SVM, kernel ridge regressions, and Gaussian process. It uses two types of stochastic approximation simultaneously: random data points instead of the whole dataset (as in stochastic update rules), and random features instead of the true kernel functions (as in randomized component analysis). These two approximations lead to the following benefits:
Computation efficiency The key computation is the generation of a mini-batch of random features and the evaluation of them on a mini-batch of data points, which is very efficient.
Memory efficiency Instead of storing training data points, we just keep a small program for regenerating the random features, and sample previously used random features according to pre-specified random seeds. This leads to huge savings: the memory requirement up to step is , independent of the dimension of the data.
Adaptibility Unlike other approaches that can only work with a fixed number of random features beforehand, doubly stochastic approach is able to increase the model complexity by using more features when new data points arrive, and thus enjoys the advantage of nonparametric methods.
Although on first look our method appears similar to the approach in , the two methods are fundamentally different. In , they address convex problems, whereas our problem is highly non-convex. The convergence result in  crucially relies on the properties of convex functions, which do not translate to our problem. Instead, our analysis centers around the stochastic update of power iterations, which uses a different set of proof techniques.
In this paper, we make the following contributions.
General framework We show that the general framework of doubly stochastic updates can be applied in various kernel component analysis tasks, including KPCA, KSVD, KCCA, etc..
Strong theoretical guarantee We prove that the finite time convergence rate of doubly stochastic approach is . This is a significant result since 1) the global convergence result is w.r.t. a non-convex problem; 2) the guarantee is for update rules without explicit orthogonalization. Previous works require explicit orthogonalization, which is impractical for kernel methods on large datasets.
Strong empirical performance Our algorithm can scale to datasets with millions of data points. Moreover, the algorithm can often find much better solutions thanks to the ability to use many more random features. We demonstrate such benefits on both synthetic and real world datasets.
Since kernel PCA is a typical task, we focus on it in the paper and provide a description of other tasks in Section 6. Although we only state the guarantee for kernel PCA, the analysis naturally carries over to the other tasks.
2 Related work
Many efforts have been devoted to scale up kernel methods. The random feature approach [22, 23] approximates the kernel function with explicit random feature mappings and solves the problem in primal form, thus circumventing the quadratic computational complexity. It has been applied to various kernel methods [15, 8, 16], among which most related to our work is Randomized Component Analysis . One drawback of Randomized Component Analysis is that their theoretical guarantees are only for kernel matrix approximation: it does not say anything about how close the solution obtained from randomized PCA is to the true solution. In contrast, we provide a finite time convergence rate of how our solution approaches the true solution. In addition, even though a moderate size of random features can work well for tens of thousands of data points, datasets with tens of millions of data points require many more random features. Our online approach allows the number of random features, hence the flexibility of the function class, to grow with the number of data points. This makes our method suitable for data streaming setting, which is not possible for previous approaches.
Online algorithms for PCA have a long history. Oja proposed two stochastic update rules for approximating the first eigenvector and provided convergence proof in [19, 20], respectively. These rules have been extended to the generalized Hebbian update rules [25, 27, 4] that compute the top eigenvectors (the subspace case). Similar ones have also been derived from the perspective of optimization and stochastic gradient descent [27, 2]. They are further generalized to the kernel case [13, 6, 11]. However, online kernel PCA needs to store all the training data, which is impractical for large datasets. Our doubly stochastic method avoids this problem by using random features and keeping only a small program for regenerating previously used random features according to pre-specified seeds. As a result, it can scale up to tens of millions of data points.
For finite time convergence rate,  proved the rate for the top eigenvector in linear PCA using Oja’s rule. For the same task,  proposed a noise reduced PCA with linear convergence rate, where the rate is in terms of epochs, i.e., number of passes over the whole dataset. The noisy power method presented in  provided linear convergence for a subspace, although it only converges linearly to a constant error level. In addition, the updates require explicit orthogonalization, which is impractical for kernel methods. In comparison, our method converges in for a subspace, without the need for orthogonalization.
3.1 Kernels and Covariance Operators
A kernel is a function that is positive-definite (PD), i.e., for all , , and , we have
A reproducing kernel Hilbert space (RKHS) on is a Hilbert space of functions from to . is an RKHS if and only if there exists a such that If such a exist, it is unique and it is a PD kernel. A function if and only if .
Given a distribution , a kernel function with RKHS , the covariance operator is a linear self-adjoint operator defined as
and furthermore , .
Let be a list of functions in the RKHS, and we define matrix-like notation
and is a matrix, whose -th element is . The outer-product of a function defines a linear operator such that
Let be a list of functions, then the weighted sum of a set of linear operators, , can be denoted using matrix-like notation as
where is a diagonal matrix with on the -th entry of the diagonal.
3.2 Kernel PCA
Kernel PCA aims to identify the top eigenfunctions for the covariance operator , where is also called the top subspace for .
A function is an eigenfunction of covariance operator with the corresponding eigenvalue if
Given a set of eigenfunctions and associated eigenvalues , where . We can denote the eigenvalue of as
where is the top eigenfunctions of , and is a diagonal matrix with the corresponding eigenvalues, is the collection of the rest of the eigenfunctions, and is a diagonal matrix with the rest of the eigenvalues.
In the finite data case, the empirical covariance operator is or denoted as . According to the representer theorem, the solutions of the top eigenfunctions of can be expressed as linear combinations of the training points with the set of coefficients ,
Using and the kernel trick, we have
where is the Gram matrix.
The infinite dimensional problem is thus reduced to a finite dimensional eigenvalue problem. However, this dual approach is clearly impractical on large scale datasets due quadratic memory and computational costs.
|Dot Product |
is random diagonal matrix and the columns of are uniformly selected from . and are positive parameters.
is a modified Bessel function. stands for element-wise product. ,
3.3 Random feature approximation
The usage of random features to approximate a kernel function is motivated by the following theorem.
Theorem 1 (Bochner).
A continuous, real-valued, symmetric and shift-invariant function on is a PD kernel if and only if there is a finite non-negative measure on , such that where is a uniform distribution on , and .
The theorem says that any shift-invariant kernel function , e.g., Gaussian RBF kernel, can be considered as an expectation of two feature functions and , where the expectation is taked over a distribution on the random frequency and phase .
We can therefore approximate the kernel function as an empirical average of samples from the distribution. In other words,
where are i.i.d. samples drawn from from and , respectively.
The specific random feature functions and distributions have been worked out for many popular kernels. For Gaussian RBF kernel, , this yields a Gaussian distribution with density proportional to ; for the Laplace kernel, this yields a Cauchy distribution; and for the Martern kernel, this yields the convolutions of the unit ball . Similar representation where the explicit form of and are known can also be derived for rotation invariant kernel, , using Fourier transformation on sphere . For polynomial kernels, , a random tensor sketching approach can also be used . See Table 1 for explicit representations of different kernels.
In this section, we describe an efficient algorithm based on the “doubly stochastic gradients” to scale up kernel PCA. KPCA is essentially an eigenvalue problem in a functional space. Traditional approaches convert it to the dual form, leading to another eigenvalue problem whose size equals the number of training points, which is not scalable. Other approaches solve it in the primal form with stochastic functional gradient descent. However, these algorithms need to store all the training points seen so far. They quickly run into memory issues when working with hundreds of thousands of data points.
We propose to tackle the problem with “doubly stochastic gradients”, in which we make two unbiased stochastic approximations. One stochasticity comes from sampling data points as in stochastic gradient descent. Another source of stochasticity is from random features to approximate the kernel.
One technical difficulty in designing doubly stochastic KPCA is an explicit orthogonalization step required in the update rules, which ensures the top eigenfunctions are orthogonal. This is infeasible for kernel methods on a large dataset since it requires solving an increasingly larger KPCA problem in every iteration. To solve this problem, we formulate the orthogonality constraints into Lagrange multipliers which leads to an Oja-style update rule. The new update enjoys small per iteration complexity and converges to the ground-truth subspace.
We present the algorithm by first deriving the stochastic functional gradient update without random feature approximations, then introducing the doubly stochastic updates.
4.1 Stochastic functional gradient update
Kernel PCA can be formulated as the following non-convex optimization problem
where and is the -th function.
The Lagrangian that incorporates the constraint is
where is the Lagrangian multiplier. The gradient of the Lagrangian w.r.t is
Furthermore, from the optimality conditions
we can find
Plugging this into the gradient, it suggests the following update rule
4.2 Doubly stochastic update
The update rule (9) has a fundamental computational drawback. At each time step , a new basis is added to , and it is therefore a linear combination of the feature mappings of all the data points up to . This requires the algorithm to store all the data points it has seen so far, which is impractical for large scale datasets.
To address this issue, we use the random feature approximation . Denote the function we get at iteration , the update rule becomes
where is the evaluation of at the current data point: .
Given , we can explicitly represent as a linear combination of all the random feature functions :
where are the coefficients, and .
The update rule on the functions corresponds to the following update for the coefficients
The specific updates in terms of the coefficients are summarized in Algorithms 1 and 2. Note that in theory new random features are drawn in each iteration, but in practice one can revisit these random features.
In this section, we provide finite time convergence guarantees for our algorithm. As discussed in the previous section, explicit orthogonalization is not scalable for the kernel case, therefore we need to provide guarantees for the updates without orthogonalization. This challenge is even more prominent when using random features, since it introduces additional variance.
Furthermore, our guarantees are w.r.t. the top -dimension subspace. Although the convergence without normalization for a top eigenvector has been established before [19, 20], the subspace case is complicated by the fact that there are angles between -dimension subspaces, and we need to bound the largest angle. To the best of our knowledge, our result is the first finite time convergence result for a subspace without explicit orthogonalization.
Note that even though it appears our algorithm is similar to  on the surface, the underlying analysis is fundamentally different. In , the result only applies to convex problems where every local optimum is a global optimum while the problems we consider are highly non-convex. As a result, many techniques that  builds upon are not applicable.
In order to analyze the convergence of our doubly stochastic kernel PCA algorithm, we will need to define a few intermediate subspaces. For simplicity of notation, we will assume the mini-batch size for the data points is one.
Let be the subspace estimated using stochastic gradient and explicit orthogonalization:
Let be the subspace estimated using stochastic update rule without orthogonalization:
where and can be equivalently written using the evaluation of the function on the current data point, leading to the equivalent rule :
Let be the subspace estimated using stochastic update rule without orthogonalization, but the evaluation of the function on the current data point is replaced by the evaluation :
Let be the subspace estimated using doubly stochastic update rule without orthogonalization, i.e., the update rule:
The relation of these subspaces are summarized in Table 2. Using these notations, we describe a sketch of our analysis in the rest of the section, while the complete proofs are provided in the appendix.
We first consider the subspace estimated using the stochastic update rule, since it is simpler and its proof can provide the bases for analyzing the subspace estimated by the doubly stochastic update rule.
|Subspace||Evaluation||Orth.||Data Mini-batch||RF Mini-batch|
5.2 Conditions and Assumptions
We will focus on the case when a good initialization is given:
Throughout the paper we suppose and regard and as constants. Note that this is true for all the kernels and corresponding random features considered. We further regard the eigengap as a constant, which is also true for typical applications and datasets.
5.3 Update without random features
Our guarantee is on the cosine of the principal angle between the computed subspace and the ground truth eigen subspace (also called potential function): .
Consider the two different update rules, one with explicit orthogonalization and another without
where is the empirical covariance of a mini-batch. Our final guarantee for is the following.
Assume (13) and suppose the mini-batch sizes satisfy that for any , There exist step sizes such that
The convergence rate is in the same order as that of computing only the top eigenvector in linear PCA . The bound requires the mini-batch size is large enough so that the spectral norm of is approximated up to the order of the eigengap. This is because the increase of the potential is in the order of the eigengap. Similar terms appear in the analysis of the noisy power method  which, however, requires orthogonalization and is not suitable for the kernel case. We do not specify the mini-batch size, but by assuming suitable data distributions, it is possible to obtain explicit bounds; see for example [30, 5].
Proof sketch We first prove the guarantee for the orthogonalized subspace which is more convenient to analyze, and then show that the updates for and are first order equivalent so enjoys the same guarantee. To do so, we will require lemma 3 and 4 below
Let denote , then a key step in proving the lemma is to show the following recurrence
We will need the mini-batch size large enough so that is smaller than the eigen-gap.
Another key element in the proof of the theorem is the first order equivalence of the two update rules. To show this, we introduce to denote the subspace by applying the update rule of on . We show that the potentials of and are close:
5.4 Doubly stochastic update
The computed in the doubly stochastic update is no longer in the RKHS so the principal angle is not well defined. Instead, we will compare the evaluation of functions from and the true principal subspace respectively on a point . Formally, we show that for any function with unit norm , there exists a function in such that for any , is small with high probability.
To do so, we need to introduce a companion update rule: resulting in function in the RKHS, but the update makes use of function values from which outside the RKHS. Let be the coefficients of projected onto , , and . Then the error can be decomposed as
By definition, , so the first error term can be bounded by the guarantee on , which can be obtained by similar arguments in Theorem 2. For the second term, note that is defined in such a way that the difference between and is a martingale, which can be bounded by careful analysis.
Assume (13) and suppose the mini-batch sizes satisfy that for any , and are of order . There exist step sizes , such that the following holds. If for all , then for any and any function in the span of with unit norm , we have that with probability at least , there exists in the span of satisfying
The point-wise error scales as with the step . Besides the condition that is up to the order of the eigengap, we additionally need that the random features approximate the kernel function up to constant accuracy on all the data points up to time , which eventually leads to mini-batch sizes. Finally, we need to be roughly isotropic, i.e., is roughly orthonormal. Intuitively, this should be true for the following reasons: is orthonormal; the update for is close to that for , which in turn is close to that are orthonormal.
Proof sketch In order to bound term I in (15), we show that
This is proved by following similar arguments to get the recurrence (14), except with an additional error term, which is caused by the fact that the update rule for is using the evaluation rather than . Bounding this additional term thus relies on bounding the difference between , which is also what we need for bounding term II in (15). For this, we show:
For any and unit vector , with probability over ,
The key to prove this lemma is that our construction of makes sure that the difference between and consists of their difference in each time step. Furthermore, the difference forms a martingale and thus can be bounded by Azuma’s inequality. See the supplementary for the details.
The proposed algorithm is a general technique for solving eigenvalue problems in the functional space. Numerous machine learning algorithms boil down to this fundamental operation. Therefore, our method can be easily extended to solve many related tasks, including latent variable estimation, kernel CCA, spectral clustering, etc..
We briefly illustrate how to extend to different machine learning algorithms in the following subsections.
6.1 Locating individual eigenfunctions
The proposed algorithm finds the subspace spanned by the top eigenfunctions, but it does not isolate the individual eigenfunctions. When we need to locate these individual eigenfunctions, we can use a modified version, called Generalized Hebbian Algorithm (GHA) . Its update rule is
where is an operator that sets the lower triangular parts to zero.
To understand the effect of the upper triangular operator, we can see that forces the update rule for the first function of to be exactly the same as that of one-dimensional subspace; all the contributions from the other functions are zeroed out.
Therefore, the first function will converge to the eigenfunction corresponding to the top eigenvalue.
For all the other functions, implements a Gram-Schmidt-like orthogonalization that subtracts the contributions from other eigenfunctions.
6.2 Latent variable models and kernel SVD
Latent variable models are probabilistic models that assume unobserved or latent structures in the data. It appears in specific forms such as Gaussian Mixture Models (GMM), Hidden Markov Models (HMM) and Latent Dirichlet Allocations (LDA), etc..
The EM algorithm  is considered the standard approach to solve such models. Recently, spectral methods have been proposed to estimate latent variable models with provable guarantees [1, 29]. Compared with the EM algorithm , spectral methods are faster to compute and do not suffer from local optima.
The key algorithm behind spectral methods is the SVD. However, kernel SVD scales quadratically with the number of data points. Our algorithm can be straightforwardly extended to solve kernel SVD. The extension hinges on the following relation
where is the SVD of .
It is therefore reduced to the eigenvalue problem. Plugging it into the update rule and treating the two blocks separately, we thus get two simultaneous update rules
The algorithm for updating the coefficients is summarized in Algorithm 3.
6.3 Kernel CCA and generalized eigenvalue problem
Kernel CCA and ICA  can also be solved under the proposed framework because they can be viewed as generalized eigenvalue problem.
Given two variables and , CCA finds two projections such that the correlations between the two projected variables are maximized. Given the covariance matrices , , and , CCA is equivalent to the following problem
where and are the top canonical correlation functions for variables and , respectively, and is the corresponding canonical correlation.
This is a generalized eigenvalue problem. It can reformulated as the following non-convex optimization problem
Following the derivation for the standard eigenvalue problem, we get the foliowing update rules
Denote and the canonical correlation functions for and , respectively. We can rewrite the above update rule as two simultaneous rules
We present the detailed updates for coefficients in Algorithm 4.
6.4 Kernel sliced inverse regression
We demonstrate the effectiveness and scalability of our algorithm on both synthetic and real world datasets.
7.1 Synthetic dataset with analytical solution
We first verify the convergence rate of DSGD-KPCA on a synthetic dataset with analytical solution of eigenfunctions . If the data follow a Gaussian distribution, and we use a Gaussian kernel, then the eigenfunctions are given by the Hermite polynomials.
We generated 1 million data points, and ran DSGD-KPCA with a total of 262,144 random features. In each iteration, we use a data mini-batch of size 512, and a random feature mini-batch of size 128. After all random features are generated, we revisit and adjust the coefficients of existing random features. The kernel bandwidth is set as the true bandwidth of the data.
The step size is scheduled as
where and are two parameters. We use a small such that in early stages the step size is large enough to arrive at a good initial solution.
Convergence Figure 1 shows the convergence rate of the proposed algorithm seeking top subspace. The potential function is calculated as the squared sine function of the subspace angle between the current solution and the ground-truth. We can see the algorithm indeed converges at the rate .
Eigenfunction Recovery Figure 2 demonstrate the recovered top eigenfunctions compared with the ground-truth. We can see the found solution coincides with one eigenfunction, and only disagree slightly on two others.
7.2 Nonparametric Latent Variable Model
In , the authors proposed a multiview nonparametric latent variable model that is solved by kernel SVD followed by tensor power iterations. The algorithm can separate latent variables without imposing specific parametric assumptions of the conditional probabilities. However, the scalability of the algorithm was limited by kernel SVD.
Here, we demonstrate that with DSGD-KSVD, we can learn latent variable models with one million data, achieving higher quality of learned components compared with two other approaches.
DSGD-KSVD uses a total of 8192 random features, and in each iteration, it uses a feature mini-batch of size 256 and a data mini-batch of size 512.
We compare with 1) random Fourier features with fixed 2048 functions, and 2) random Nystrom features with fixed 2048 functions. The Nystrom features are calculated by first uniformly sampling 2048 data points, and then evaluate kernel function values on these data points .
The dataset consists of two latent components, one is a Gaussian distribution and the other follows a Gamma distribution with shape parameter . One million data point are generated from this mixture distribution.
Figures 3 shows the learned conditional distributions for each component. We can see DSGD-KSVD achieves almost perfect recovery, while Fourier and Nystrom random feature methods either confuse high density areas or incorrectly estimate the spread of conditional distributions.
|# of feat||Random features||Nystrom features|
7.3 Kcca Mnist8m
We then demonstrate the scalability and effectiveness of our algorithm on a large-scale real world dataset. MNIST8M consists of 8.1 million hand-written digits and their transformations. Each digit is of size . We divide each image into the left and right parts, and learn their correlations using KCCA. Thus the input feature dimension is 392.
The evaluation criteria is the total correlations on the top canonical correlation directions calculated on a separate test set of size 10000. Out of the 8.1 million training data, we randomly choose 10000 as an evaluation set.
We compare with 1) random Fourier and 2) random Nystrom features on both total correlation and running time. We vary the number of random features used for both methods. Our algorithm uses a total of 20480 features. In each iteration, we use feature mini-batches of size 2048 and data mini-batches of size 1024, and we run 3000 iterations. The kernel bandwidth is set using the “median” trick and is the same for all methods. Due to randomness, all algorithms are run 5 times, and the mean is reported.
The results are presented in Table 3. We can see Nystrom features generally achieve better results than Fourier features. Note that for Fourier features, we are using the version with and pairs, so the real number of parameters is twice the number in the table, as a result the computational time is almost twice of that for Nystrom features.
Our algorithm achieves the best test-set correlations in comparable run time with random Fourier features. This is especially significant for random Fourier features, since the run time would increase by almost four times if double the number of features were used. We can also see that for large datasets, it is important to use more random features for better performance. Actually, the number of random features required should grow linearly with the number of data points. Therefore, our algorithm provides a good balance between the number of random features used and the number of data points processed.
7.4 Kernel PCA visualization on molecular space dataset
MolecularSpace dataset contains 2.3 million molecular motifs . We are interested in visualizing the dataset with KPCA. The data are represented by sorted Coulomb matrices of size . Each molecule also has an attribute called power conversion efficiency (PCE). We use a Gaussian kernel with bandwidth chosen by the “median trick”. We ran kernel PCA with a total of 16384 random features, with a feature mini-batch size of 512, and data mini-batch size of 1024. We ran 4000 iterations with step size .
Figure 4 presents visualization by projecting the data onto the top two principle components. Compared with linear PCA, KPCA shrinks the distances between the clusters and brings out the important structures in the dataset. We can also see although the PCE values do not necessarily correspond to the clusters, higher PCE values tend to lie towards the center of the ring structure.
7.5 Kernel sliced inverse regression on KUKA dataset
We evaluate our algorithm under the setting of kernel sliced inverse regression , a way to perform sufficient dimension reduction (SDR) for high dimension regression. After performing SDR, we fit a linear regression model using the projected input data, and evaluate mean squared error (MSE). The dataset records rhythmic motions of a KUKA arm at various speeds, representing realistic settings for robots . We use a variant that contains 2 million data points generated by the SL simulator. The KUKA robot has 7 joints, and the high dimension regression problem is to predict the torques from positions, velocities and accelerations of the joints. The input has 21 dimensions while the output is 7 dimensions. Since there are seven independent joints, we set the reduced dimension to be seven. We randomly select 20% as test set and out of the remaining training set, we randomly choose 5000 as validation set to select step sizes. The total number of random features is 10240, with mini-feature batch and mini-data batch both equal to 1024. We run a total of 2000 iterations using step size