Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms. In this study, we consider solving a binary classification problem using random features and stochastic gradient descent. In recent research, an exponential convergence rate of the expected classification error under the strong low-noise condition has been shown. We extend these analyses to a random features setting, analyzing the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions, to show that an exponential convergence of the expected classification error is achieved even if random features approximation is applied. Additionally, we demonstrate that the convergence rate does not depend on the number of features and there is a significant computational benefit in using random features in classification problems because of the strong low-noise condition.
Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features
Shingo Yashima &Atsushi Nitanda &Taiji Suzuki
The University of Tokyo
firstname.lastname@example.org &The University of Tokyo
email@example.com &The University of Tokyo
Kernel methods are commonly used to solve a wide range of problems in machine learning, as they provide flexible non-parametric modeling techniques and come with well-established theories about their statistical properties [caponnetto2007optimal, steinwart2009optimal, mendelson2010regularization]. However, computing estimators in kernel methods can be prohibitively expensive in terms of memory requirements for large datasets.
There are two popular approaches to scaling up kernel methods. The first is sketching, which reduces data-dimensionality by random projections. A random features method [rahimi2008random] is a representative, which approximates a reproducing kernel Hilbert space (RKHS) by a finite-dimensional space in a data-independent manner. The second is stochastic gradient descent (SGD), which allows data points to be processed individually in each iteration to calculate gradients. Both of these methods are quite effective in reducing memory requirements and are widely used in practical tasks.
For the theoretical properties of random features, several studies have investigated the approximation quality of kernel functions [sriperumbudur2015optimal, sutherland2015error, pmlr-v89-szabo19a], but only a few have considered the generalization properties of learning with random features. For the regression problem, its generalization properties in ERM and SGD settings, respectively, have been studied extensively in [rudi2017generalization] and [carratino2018learning]. In particular, they showed that features are sufficient to achieve the usual learning rate, indicating that there is a computational benefit to using random features.
However, it remains unclear whether or not it is computationally efficient for other tasks. In [rahimi2009weighted], the generalization properties were studied with Lipschitz loss functions under -constraint in hypothesis space, and it was shown that features are required for learning bounds. Also, in [li2018toward], learning with Lipschitz loss and standard regularization was considered instead of -constraint, and similar results were attained. Both results suggest that computational gains come at the expense of learning accuracy if one considers general loss functions.
In this study, learning classification problems with random features and SGD are considered, and the generalization property is analyzed in terms of the classification error. Recently, it was shown that the convergence rate of the excess classification error can be made exponentially faster by assuming the strong low-noise condition [tsybakov2004optimal, koltchinskii2005exponential] that conditional label probabilities are uniformly bounded away from [pillaud2018exponential, nitanda2018stochastic]. We extend these analyses to a random features setting to show that the exponential convergence is achieved if a sufficient number of features are sampled. Unlike when considering the convergence of loss function, the resulting convergence rate of the classification error is independent of the number of features. In other words, an arbitrary small classification error is achievable as long as there is a sufficient number of random features. So our result suggests that there is indeed a computational benefit to use random features in classification problems under the strong low-noise condition.
Our contributions. Our contributions are twofold. First, we analyze the error induced by the approximation of random features in terms of the distance between the generated hypothesis including population risk minimizers and empirical risk minimizers when using general Lipschitz loss functions. Our results can be framed as an extension of the analysis in [cortes2010impact, sutherland2015error], which analyzed the error in terms of the distance between empirical risk minimizers when using a hinge loss. Although we note that several studies consider the optimal sampling distributions of features in terms of the worst-case error [bach2017equivalence, avron2017random, li2018toward], we do not explore this direction and treat the original random features algorithm because these distributions are generally intractable or require much computational cost to sample [bach2017equivalence].
Second, using the above result, we prove that the exponential convergence rate of the excess classification error under the strong low-noise condition is achieved if a sufficient number of features are sampled. Based on this, we show that there is a significant computational gain in using random features rather than a full kernel method for obtaining a relatively small classification error.
Paper organization. This paper is organized as follows. In Section 2, the algorithm of random features and SGD treated in this study are briefly reviewed. In Section 3, an error analysis of the generated hypothesis using random features is presented, after which a more sophisticated analysis is given for the case of a Gaussian kernel. Our primary result describing the exponential convergence rate of the classification error is given in Section 4. Finally, numerical experiments using synthetic datasets are presented in Section 5.
2 Problem Setting
In this section, we provide notations to describe a problem setting and assumptions for the binary classification and kernel method treated in this study.
2.1 Binary Classification Problem
Let and be a feature space and the set of binary labels, respectively; denotes a probability measure on , by the marginal distribution on , and by the conditional distribution on , where . In general, for a probability measure , denotes a space of square-integrable functions with respect to , and denotes one with respect to the Lebesgue measure. Similarly, denotes a space of functions for which the essential supremum with respect to is bounded, and denotes one with respect to Lebesgue measure.
In the classification problem, our final objective is to choose a discriminant function such that the sign of is an accurate prediction of . Therefore, we intend to minimize the expected classification error defined below amongst all measurable functions:
where if and otherwise, and represents 0-1 loss:
By definition, minimizes . However, directly minimizing (1) to obtain the Bayes classifier is intractable because of its non-convexity. Thus, we generally use the convex surrogate loss instead of the 0-1 loss and minimize the expected loss function of :
In general, the loss function has a form where is a non-negative convex function. The typical examples are logistic loss, where and hinge loss, where Minimizing the expected loss function (3) ensures minimizing the expected classification (1) if is classification-calibrated [bartlett2006convexity], which has been proven for several practically implemented losses including hinge loss and logistic loss.
2.2 Kernel Methods and Random Features
In this study, we consider a reproducing kernel Hilbert space (RKHS) associated with a positive definite kernel function as the hypothesis space. It is known [aronszajn1950theory] that a positive definite kernel uniquely defines its RKHS such that the reproducing property holds for all and , where denotes the inner product of . Let denote the norm of induced by the inner product. Under these settings, we attempt to solve the following minimization problem:
where is a regularization parameter.
However, because solving the original problem (4) is usually computationally inefficient for large-scale datasets, the approximation method is applied in practice. Random features [rahimi2008random] is a widely used method for scaling up kernel methods because of its simplicity and ease of implementation. Additionally, it approximates the kernel in a data-independent manner, making it easy to combine with SGD. In random features, a kernel function is assumed to have the following expansion in some space with a probability measure :
The main idea behind random features is to approximate the integral (5) by its Monte-Carlo estimate:
For example, if is a shift invariant kernel, by Bochner’s theorem [yoshida1995functional], the expansion (5) is achieved with , where is a normalization constant. Then, the approximation (6) is called random Fourier features [rahimi2008random], which is the most widely used variant of random features.
We denote the RKHS associate with and by and , respectively. These spaces then admit the following explicit representation [bach2017equivalence, bach2017breaking]:
We note that the approximation space is not necessarily contained in the original space . For and , the following RKHS norm relations hold:
As a result, the problem (4) in the approximation space is reduced to the following generalized linear model:
where is a feature vector:
In this paper, we consider solving the problem (11) using the averaged SGD. The details are discussed in the following section.
2.3 Averaged Stochastic Gradient Descent
SGD is the most popular method to solve large scale learning problems. In this section, we discuss a specific form of SGD based on [nitanda2018stochastic]. It is noted that although only the averaged version of SGD is being considered, following the analysis in [nitanda2018stochastic], we can show similar results without averaging. For the optimization problem (11), its gradient with respect to is given as follows:
where is a partial derivative with respect to the first variable of . Thus, the stochastic gradient with respect to is given by . We note that the update on the parameter corresponds to the update on the function space , because a gradient on is given by
The algorithm of random features and averaged SGD is described in Algorithm 1.
Following [nitanda2018stochastic], we set the learning rate and the averaging weight as follows:
where is an offset parameter for the time index. We note that an averaged iterate can be updated iteratively as follows:
Using this formula, we can compute the averaged output without storing all internal iterate
2.4 Computational Complexity
If we assume the evaluation of a feature map to have a constant cost, one iteration in Algorithm 1 requires operations. As a result, one pass SGD on samples requires computational time. On the other hand, the full kernel method without approximation requires computations per iteration; thus, the overall computation time is , which is much more expensive than random features.
For the memory requirements, random features needs to store coefficients, and it does not depend on the sample size . On the other hand, we have to store coefficients in the full kernel method, so it is also advantageous to use random features in large-scale learning problems.
3 Error Analysis of Random Features for General Loss Function
Our primary purpose here is to bound the distance between the hypothesis generated by solving the problems in each space and . Population risk minimizers in spaces are defined as below:
The uniqueness of minimizers is guaranteed by the regularization term.
First, the -norm is bound between and when the loss function is Lipschitz continuous. Then, a more concrete analysis is provided when is a Gaussian kernel.
3.1 Error analysis for population risk minimizers
Before beginning the error analysis, some assumptions about the loss function and kernel function are imposed.
is convex and -Lipscitz continuous, that is, there exists such that for any and ,
This assumption implies -Lipschitzness of with respect to the norm, because
for any . For several practically used losses, such as logistic loss or hinge loss, this assumption is satisfied with .
To control continuity and boundedness of the induced kernel, the following assumptions are required:
The function is continuous and there exists such that for any .
If is Gaussian and is its random Fourier features, it is satisfied with . This assumption implies and it leads to an important relationship
For the two given kernels and , is also a positive definite kernel, and its RKHS includes and . The last assumption imposes a specific norm relationship in its combined RKHS of and .
Let be RKHS with the kernel function . Then there exists , and a constant depends on that satisfies, for any ,
with probability at least
For a fixed kernel function, the Assumption 3 is a commonly used condition in an analysis of kernel methods [steinwart2009optimal, mendelson2010regularization]. It is satisfied, for example, that the eigenfunctions of the kernel are uniformly bounded and the eigenvalues decay at the rate [mendelson2010regularization]. In Theorem 2, specific and that satisfy the condition for the case of a Gaussian kernel and its random Fourier features approximation are derived.
Here, we introduce our primary result, which bounds the distance between and . The complete statement, including proof and all constants, are found in Appendix C.
Theorem 1 (Simplified.).
The resulting error rate is . It can be easily shown that a consistent error rate of is seen for -norm without Assumption 3.
Comparison to previous results.
In [cortes2010impact, sutherland2015error], the distance between empirical risk minimizers of SVM (i.e. is hinge loss) were studied in terms of the error induced by Gram matrices. Considering and to be Gram matrices of kernel and , respectively, they showed that , where is an operator norm, defined in Appendix A. Because the Gram matrix can be considered as the integral operator on the empirical measure, we can apply Lemma 1 and obtain , so the resulting rate is . This coincides with our result, because when is an empirical measure, Assumption 3 holds with .
From this perspective, our result is an extension of these previous results, because we treat the more general Lipschitz loss function and general measure including empirical measure.
In [rudi2017generalization, carratino2018learning], the case of squared loss was studied. In particular, in Lemma 8 of [rudi2017generalization], the distance between and is shown as (without decreasing ). While this is a better rate than ours, our theory covers a wider class of loss functions, and a similar phenomenon is observed in the case of empirical risk minimizers for the squared loss and hinge loss [cortes2010impact].
In [bach2017equivalence], approximations of functions in by functions in were considered, but this result cannot be applied here because is not the function closest to in . Finally, we note that our result cannot be obtained from the approximation analysis of Lipschitz loss functions [rahimi2009weighted, li2018toward], where the rate was shown to be under several assumptions, because the closeness of the loss values does not imply that of the hypothesis.
3.2 Further analysis for Gaussian kernels
The following theorem shows that if is a Gaussian kernel and is its random Fourier features approximation, then the norm condition in Assumption 3 is satisfied for any .
Assume is a bounded set and has a density with respect to Lebesgue measure, which is uniformly bounded away from 0 and on . Let be a Gaussian kernel and be its RKHS; then, for any , there exists a constant such that
for any . Also, for any , let be a random Fourier features approximation of with features and be a RKHS of . Then, with probability at least with respect to a sampling of features,
for any .
We note that the norm relation of the Gaussian RKHS (26) is a known result in [steinwart2009optimal] and our analysis extends this to the combined RKHS .
The proof is based on the following fact:
Let us denote by . First, from [steinwart2009optimal] we have
and there exists a constant such that
where and denote Sobolev and Besov space, respectively, and denotes real interpolation of Banach spaces and (see [steinwart2008support]). Also, by Sobolev’s embedding theorem for Besov space, can be continuously embedded in . Finally, from the condition on , there exists a constant such that
Remark. Although we consider as Gaussian, the statement itself holds if the probability measure has finite every order moments (see proof in Appendix D). In particular, for shift-invariant kernel , if belongs to the Schwartz class (including the case of a Gaussian kernel), (Fourier transform of ) also belongs to it, indicating that every moment is finite from the property of the Schwartz class [yoshida1995functional] and the statement of Theorem 2 holds.
4 Main Result
In this section, we show that learning classification problems with SGD and random features achieve the exponential convergence of the expected classification error under certain conditions. Before providing our results, several assumptions are imposed on the classification problems. The first is the margin condition on the conditional label probability.
The strong low-noise condition holds:
This condition is commonly used in the theoretical analysis of classification problems [mammen1999smooth, audibert2007fast, koltchinskii2005exponential]. It is the strongest version of the low-noise condition [tsybakov2004optimal, bartlett2006convexity], that is,
for some . This condition is used to derive a faster convergence rate of empirical risk minimizer than [tsybakov2004optimal, bartlett2006convexity]. Greater means that there are less data which are difficult to predict, and our assumption corresponds to the case of .
The second is the condition on the link function [bartlett2006convexity, zhang2004statistical], which connects the hypothesis space and the probability measure:
Its corresponding value is denoted by :
It is known that is a concave function [zhang2004statistical]. Although may not be uniquely determined nor well-defined in general, the following assumption ensures these properties.
takes values in , -almost surely; is differentiable and is well-defined, differentiable, monotonically increasing, and invertible over . Moreover, it follows that
For logistic loss, , and the above condition is satisfied. Next, following [zhang2004statistical], we introduce Bregman divergence for concave function to ensure the uniqueness of Bayes rule :
Bregman divergence derived by is positive, that is, if and only if . For the expected risk , a unique Bayes rule (up to zero measure sets) exists in .
For logistic loss, it is known that coincides with Kullbuck-Leibler divergence, and thus, the positivity of the divergence holds. If is differentiable and is differentiable and invertible, the excess loss function can be expressed using [zhang2004statistical]:
Finally, we introduce the following notation:
Using this notation, Assumption 4 can be reduced to the Bayes rule condition, that is, , -almost surely. For logistic loss, . Under these assumptions and notations, the exponential convergence of the expected classification error is shown.
Consider Algorithm 1 with and where is a positive value such that and Then, with probability , for sufficiently large such that
we have the following inequality for any :
The complete statement and proof are given in Appendix E. We note that although a certain number of features are required to achieve the exponential convergence, the resulting rate does not depend on . In contrast to this, when one considers the convergence rate of the loss function, its rate depends on in general [rudi2017generalization, carratino2018learning, rahimi2009weighted, li2018toward]. From this fact, we can show that random features can save computational cost in a relatively small classification error regime. A detailed discussion is presented below.
As a corollary, we show a simplified result when learning with random Fourier features approximation of a Gaussian kernel and logistic loss, which can be obtained by setting and in Theorem 3 and applying Theorem 2.
Assume is a bounded set and has a density with respect to Lebesgue measure, which is uniformly bounded away from 0 and on .
Let be a Gaussian kernel and be logistic loss. Under Assumption 1-6, There exists a sufficiently small such that the following statement holds:
Taking a number of random features that satisfies
Consider Algorithm 1 with and where is a positive value such that and Then, with probability , for a sufficiently large such that
we have the following inequality for any :
Computational Viewpoint. As shown in Theorem 3, once a sufficient number of features are sampled, the convergence rate of the excess classification error does not depend on the number of features . This is unexpected because when considering the convergence of the loss function, the approximation error induced by random features usually remains [rudi2017generalization, li2018toward, rahimi2009weighted]. Thus, to obtain the best convergence rate, we have to sample more as the sample size increases.
From this fact, it can be shown that to achieve a relatively small classification error, learning with random features is indeed more computationally efficient than learning with a full kernel method without approximation. As shown in Section 2.4, if one runs SGD in Algorithm 1 with more than iterations, both the time and space computational costs of a full kernel method exceed those of random features. In particular, if one can achieve a classification error such that
then the required number of iterations exceeds the required number of features in Theorem 3, and the overall computational cost become larger in a full kernel method. Theoretical results which suggest the efficiency of random features in terms of generalization error have only been derived in the regression setting [rudi2017generalization, carratino2018learning]; this is the first time the superiority of random features has been demonstrated in the classification setting. Moreover, this result shows that an arbitrary small classification error is achievable as long as there is a sufficient number of random features unlike the regression setting where a required number of random features depend on the target accuracy.
In this section, the behavior of the SGD with random features studied on synthetic datasets is described. We considered logistic loss as a loss function, a Gaussian kernel as an original kernel function, and its random Fourier features as an approximation method. Two-dimensional synthetic datasets were used, as shown in Figure 1. The dataset support is composed of four parts: . For two of them, the conditional probability is , and for the other two, . This distribution satisfies the strong low-noise condition with . For hyper-parameters, we set and . The averaged stochastic descent was run 100 times with 12,000 iterations and the classification error and loss function were calculated on 100,000 test samples. The average of each run is reported with standard deviations.
First, the learning curves of the expected classification error and the expected loss function are drawn when the number of features , as shown in Figure 2. Our theoretical result suggests that with sufficient features, the classification error converges exponentially fast, whereas the loss function converges sub-linearly. We can indeed observe a much faster decrease in the classification error (left) than in the loss function (right).
Next, we show the learning curves of the expected classification error when the number of features are varied as in Figure 3. We can see that the exact convergence of the classification error is not attained with relatively few features such as , which also coincides with our results.
Finally, the convergence of the classification error is compared in terms of computational cost between the random features model with and the full kernel model without approximation. In Figure 4, the learning curves are drawn with respect to the number of parameter updates; the full kernel model requires increasing numbers of updates in later iterations, whereas the random features model requires a constant number of updates. It can be observed that both random features models require fewer parameter updates to achieve the same classification error than the full kernel model for a relatively small classification error. This implies that random features approximation is indeed computationally efficient under a strong low-noise condition.
This study shows that learning with SGD and random features could achieve exponential convergence of the classification error under a strong low-noise condition. Unlike when considering the convergence of a loss function, the resulting convergence rate of the classification error is independent of the number of features, indicating that an arbitrary small classification error is achievable as long as there is a sufficient number of random features. Our results suggest, for the first time, that random features is theoretically computationally efficient even for classification problems under certain settings. Our theoretical analysis has been verified by numerical experiments.
One possible future direction is to extend our analysis to general low-noise conditions to derive faster rates than , as in [pillaud2018exponential] in the case of the squared loss. It could also be interesting to explore the convergence speed of more sophisticated variants of SGD, such as stochastic accelerated methods and stochastic variance reduced methods [schmidt2017minimizing, johnson2013accelerating, defazio2014saga, allen2017katyusha].
A Notation and Useful Propositions
Let be a Hilbert space. For , we denote an operator norm of as , that is,
For , we define an outer product as follows:
Let be a closed subspace of , then a projection onto is well defined and we denote its operator by . Then we have
The following inequality shows that the difference between the square root of two self-adjoint positive semi-definite operators is bounded by the square root of the difference of them.
Let be a separable Hilbert space. For any compact, positive semi-definite, self-adjoint operators , the following inequality holds:
Let be the eigenvalue with largest absolute value and be the corresponding normalized eigenfunction of , i.e.
Without loss of generality, we can assume that . Since is also positive semi-definite, we have
Thus we have
which completes the proof. ∎
The following inequality is from Proposition 3 in [rudi2017generalization]. It is a generalization of the Bernstein inequality to random operators on separable Hilbert space and used in Lemma 1 to derive the concentration of integral operators.
Proposition 2 (Bernstein’s inequality for sum of random operators).
Let be a separable Hilbert space and let be a sequence of independent and identically distributed self-adjoint random operators on . Assume that and there exists such that almost surely for any . Let S be the positive operator such that . Then for any , the following inequality holds with probability at least :
Restatement of Proposition 3 in [rudi2017generalization]. ∎
B Basic Properties of RKHS
In analyses of kernel methods, it is common to assume is compact, has the full support and is continuous because under such assumptions we utilize Mercer’s theorem to characterize RKHS [cucker2002mathematical, aronszajn1950theory]. However, such an assumption may not be adopted under the strong low noise condition in which may not have full support. In this section, we explain some basic properties of reproducing kernel Hilbert space (RKHS) under more general settings based on [dieuleveut2016nonparametric, steinwart2012mercer].
First, for given kernel function and its RKHS , we define a covariance operator as follows:
It is well-defined through Riesz’ representation theorem. Using reproducing property, we have
where expectation is defined via a Bochner integration. From the representation (63), we can extend the covariance operate to . We denote this by as follows:
Following [dieuleveut2016nonparametric], here we denote a set of square integral function itself by , that is, its quotient is , which is separable Hilbert space. We can also define the extended covariance operator as follows:
Here we present some properties of these covariance operators from [dieuleveut2016nonparametric].
is self-adjoint, continuous operator and
is continuous, self-adjoint, positive semi-definite operator.
is well-defined and an isometry. In particular, for any , there exists such that
We denote the extended covariate operator associate with by and .
As with (65), we have
The next lemma provides a probabilistic bounds about the difference of the two covariate operators and .
For any the following inequality holds with probability at least :
Let Then Also, we have